This article provides a critical evaluation of the accuracy of SMILES and SELFIES molecular representations for AI-driven drug discovery.
This article provides a critical evaluation of the accuracy of SMILES and SELFIES molecular representations for AI-driven drug discovery. It covers the foundational principles of these string-based formats, explores their methodological application in machine learning models like transformers, and addresses key troubleshooting and optimization strategies to overcome inherent limitations. A central focus is the comparative validation of their performance in real-world tasks such as property prediction and synthesis planning, synthesizing recent research to offer practical guidance for researchers and scientists on selecting and implementing the most effective representation for their specific objectives.
In the field of AI-driven drug discovery and materials science, representing complex molecular structures in a format that computers can understand is a foundational challenge. Molecular string representations serve as this crucial bridge, translating the intricate topology of atoms and bonds into linear sequences that machine learning models, particularly language models, can process. The two predominant languages in this domain are the Simplified Molecular Input Line Entry System (SMILES) and the more recent Self-Referencing Embedded Strings (SELFIES). The choice between them significantly impacts the performance, robustness, and applicability of AI models in molecular property prediction, virtual screening, and de novo molecular design. This guide provides a comparative analysis of SMILES and SELFIES, grounded in recent experimental data, to inform researchers and developers in selecting the optimal representation for their specific applications.
Introduced in 1988, SMILES is a widely adopted notation that encodes a molecular graph into a linear string of ASCII characters [1]. It uses atomic symbols (e.g., C, O, N), bond symbols (e.g., -, =, # for single, double, and triple bonds), and parentheses to represent branches. Ring structures are indicated by matching numbers placed after the involved atoms [2]. For example, benzene is represented as c1ccccc1 [2]. Its key advantage is simplicity and human-readability, making it a staple in chemical databases and a natural fit for early NLP-based AI approaches [3].
However, SMILES has critical limitations. A single molecule can have multiple valid SMILES strings depending on the starting atom and traversal order, leading to ambiguity [1] [4]. More critically, its grammar does not inherently enforce chemical validity. When used in generative AI models, this often results in a high percentage of syntactically or semantically invalid strings that represent impossible molecules, hampering automated design workflows [1] [5].
SELFIES was developed specifically to address the robustness issues of SMILES. Its fundamental innovation is a grammar based on a formal grammar that guarantees 100% syntactic and semantic validity [2] [5]. Every possible SELFIES string corresponds to a valid molecule. It achieves this by representing complex spatial features like rings and branches with single, dedicated symbols (e.g., [Ring1], [Branch1]), with the length of the feature explicitly encoded [2]. This makes SELFIES particularly powerful for generative tasks, as random mutations or model outputs always produce viable molecules, leading to a denser and more explorable chemical latent space [1].
The theoretical advantages of SELFIES translate into measurable differences in model performance across various tasks. The following tables consolidate key experimental findings from recent literature.
Table 1: Performance Comparison on MoleculeNet Benchmark Tasks (ROC-AUC)
| Model / Representation | HIV | Tox21 | SIDER | BBBP | Notes | Source |
|---|---|---|---|---|---|---|
| SMILES (ChemBERTa-2) | 0.780 | 0.839 | 0.835 | Info Missing | Reported as competitive | [6] |
| SELFIES (SELFormer) | Info Missing | Info Missing | 0.810 | 0.950 | Outperformed GNNs & SMILES | [7] |
| SELFIES (Augmented) | 0.844 (Avg.) | Info Missing | 0.844 (Avg.) | Info Missing | 5.97% improvement over SMILES in classical models | [2] |
| SMILES (Domain-Adapted to SELFIES) | Info Missing | Info Missing | Info Missing | Info Missing | Matched/exceeded SMILES baseline | [7] |
Table 2: Performance on Quantum Chemistry and Physicochemical Property Prediction
| Model / Representation | ESOL (RMSE↓) | FreeSolv (RMSE↓) | Lipophilicity (RMSE↓) | QM9 (MAE↓) | Notes | Source |
|---|---|---|---|---|---|---|
| SMILES (ChemBERTa-77M-MLM) | ~1.10 (est.) | ~2.70 (est.) | ~0.80 (est.) | Baseline | Trained on 77M molecules | [7] |
| SELFIES (SELFormer) | 0.53 | Info Missing | Info Missing | Info Missing | >15% improvement over GEM GNN | [7] |
| SELFIES (Domain Adapted) | 0.944 | 2.511 | 0.746 | Competitive | Adapted with only 700K molecules | [7] |
Key Insights from Data:
A critical differentiator in modern chemical language models is the tokenization philosophy.
A significant challenge in evaluating chemical language models (ChemLMs) is determining whether they have learned underlying chemical principles or are merely memorizing textual patterns. The Augmented Molecular Retrieval (AMORE) framework was developed to address this [4].
Hypothesis: A robust ChemLM should recognize that different, valid SMILES or SELFIES strings representing the same molecule are semantically equivalent. Its internal embeddings for these synonymous strings should be very similar.
Methodology:
Experiments using AMORE have revealed that many state-of-the-art ChemLMs are not robust to different SMILES representations, highlighting a key area for future improvement [4]. The following diagram illustrates the AMORE workflow.
Training a transformer model on SELFIES from scratch is computationally expensive. A resource-efficient alternative is Domain-Adaptive Pre-Training (DAPT). A recent study demonstrated that a SMILES-pretrained model (ChemBERTa) can be successfully adapted to SELFIES without changing its tokenizer or architecture [7].
Experimental Protocol:
Results: The domain-adapted model matched or exceeded the performance of the original SMILES-based model and even the larger ChemBERTa-77M-MLM on most targets, despite a 100x smaller pre-training corpus. This provides a practical, low-cost pathway for researchers to leverage SELFIES [7]. The workflow is summarized below.
Table 3: Key Software and Datasets for Molecular Representation Research
| Tool / Resource | Type | Primary Function | Relevance to SMILES/SELFIES |
|---|---|---|---|
| RDKit | Software Library | Cheminformatics and molecule manipulation | Industry-standard for converting between molecular formats (e.g., SMILES to 2D image), calculating descriptors, and processing SELFIES [8]. |
| SELFIES Python Library | Software Library | SELFIES encoding/decoding | Essential for converting SMILES to SELFIES and vice-versa, enabling their use in machine learning pipelines [7]. |
| MoleculeNet | Benchmark Dataset | Curated collection for molecular ML | Standard benchmark for fair comparison of model performance across tasks like HIV, Tox21, ESOL, etc. [2] [7] |
| PubChem | Chemical Database | Public repository of chemical molecules | Primary source for large-scale, diverse molecular structures for pre-training chemical language models [8] [7]. |
| ZINC | Chemical Database | Commercial compounds for virtual screening | Another major source of millions of purchasable compounds, often used for pre-training models like ChemBERTa [4]. |
| Hugging Face Transformers | Software Library | NLP model architecture & pre-trained models | Provides the core transformer implementation (BERT, RoBERTa) and framework for building models like ChemBERTa and MolBERT [1] [6]. |
The choice between SMILES and SELFIES is not a simple verdict of one being universally superior. Instead, it is a strategic decision based on the specific application, available resources, and desired model properties.
The future of molecular representations lies in hybrid and adaptive approaches. As demonstrated by domain adaptation, it is possible to leverage the vast existing investment in SMILES-based models while gaining the benefits of SELFIES. Emerging tokenization methods like APE show that customizing the entire NLP pipeline for chemical language can yield significant performance gains. As AI continues to reshape drug discovery and materials science, the evolution of molecular representations will undoubtedly remain a vibrant and critical area of research.
The Simplified Molecular Input Line Entry System (SMILES) is a cornerstone of computational chemistry, providing a compact, human-readable ASCII string for representing molecular structures [9]. Developed in the 1980s by David Weininger, SMILES functions as a linearized, serialized representation of molecular graphs, encoding atoms, bonds, branching, ring closures, and stereochemistry in a one-dimensional format [10] [9]. This representation has become indispensable for chemical databases, machine learning applications, and drug discovery pipelines, enabling efficient storage, retrieval, and computational analysis of chemical information [1] [11] [9].
Despite its widespread adoption, SMILES exhibits inherent limitations that pose significant challenges for computational applications. The notation permits multiple valid string representations for the same molecule, creating redundancy that can impair machine learning model performance [12]. Furthermore, SMILES strings can be syntactically valid yet chemically impossible, leading to semantic errors that complicate generative modeling [1] [13]. These shortcomings have motivated the development of alternative representations including SELFIES, DeepSMILES, and t-SMILES, each designed to address specific limitations while maintaining the sequence-based paradigm favorable for natural language processing techniques [1] [14].
This guide provides a comprehensive technical analysis of SMILES syntax, encoding rules, and common pitfalls, framed within an empirical comparison of contemporary molecular representations. By examining experimental data across key performance metrics including syntactic validity, semantic correctness, distribution learning, and goal-directed optimization, we establish a rigorous framework for evaluating representation efficacy in cheminformatics applications.
SMILES encodes molecular structures through a systematic grammar that translates molecular graphs into linear strings. The basic syntax elements include:
Atomic Representation: Standard atomic symbols (C, N, O, P, S, F, Cl, Br, I) constitute the "organic subset" and can be written without brackets, with implied hydrogen atoms added to satisfy valence requirements [9]. Elements outside this subset must be enclosed in square brackets (e.g., [Na+] for sodium cation), as must atoms with specified isotopes, charges, or unusual valence (e.g., [13C], [Fe+2]) [11] [9].
Bond Notation: Bonds are represented with specific symbols: single bonds as - (often omitted), double bonds as =, triple bonds as #, and aromatic bonds as : (typically implied in aromatic systems using lowercase atomic symbols) [9]. For example, ethene is written as C=C, while ethyne is C#C.
Branching Representation: Branched structures are denoted using parentheses, with the branch attached to the atom immediately preceding the parentheses [11]. For example, isopropanol is represented as CC(O)C, where the hydroxyl group is branched from the central carbon. Multiple or nested branches are permitted, such as CC(O)(Cl)C for a central carbon bearing both hydroxy and chloro substituents [10].
Ring Closures: Cyclic structures are formed by breaking one bond in the ring and assigning matching numerical labels to the atoms involved in the ring closure [9]. For example, cyclohexane is represented as C1CCCCC1, where the terminal '1' indicates closure back to the first carbon. For rings beyond 9 members, two-digit numbers preceded by % are used (e.g., %10) [11].
Aromaticity Indication: Aromatic systems are represented using lowercase atomic symbols (e.g., c, n, o), with aromatic bonds typically implied rather than explicitly stated [9]. Benzene is canonically represented as c1ccccc1, where the alternating single and double bonds of Kekulé structures are replaced by delocalized aromatic bonds.
SMILES supports precise specification of molecular stereochemistry through specialized notation:
Tetrahedral Chirality: Chiral centers are indicated using @ and @@ symbols to specify anticlockwise and clockwise ordering of substituents, respectively [9]. For example, the two enantiomers of alanine are distinguished as N[C@H](C)C(=O)O and N[C@@H](C)C(=O)O.
Double Bond Stereochemistry: The configuration around double bonds is specified using / and \ symbols to indicate directional bonds [9]. For example, C/C=C/C represents the Z-isomer (cis configuration) of 2-butene, while C/C=C\C represents the E-isomer (trans configuration).
Advanced Stereocenters: SMILES supports more complex stereochemistry through specialized notation for allene-type (@AL1, @AL2), square-planar (@SP1-3), trigonal bipyramidal (@TB1-20), and octahedral (@OH1-30) stereocenters [9].
Despite its systematic grammar, SMILES presents several common pitfalls that can lead to invalid representations:
Ambiguous Ring Labels: Reusing the same ring closure digit within a single molecule creates ambiguous connections, as in CC1CC1C1 where the second '1' cannot be properly matched [11].
Unmatched Parentheses: Failure to properly close branches with parentheses results in invalid syntax, such as C(C(C which lacks necessary closing parentheses [11].
Invalid Bonding Patterns: Strings may specify chemically impossible bonding, such as CO=CC where the oxygen atom would need to form three bonds—exceeding the typical valence for neutral oxygen [14].
Aromaticity Mismatches: Incorrect specification of aromatic systems can lead to valence violations, particularly when mixing explicit bond symbols with lowercase aromatic atoms [10].
The following diagram illustrates the SMILES generation process and its relationship to common pitfalls:
Figure 1: SMILES Generation Workflow and Common Pitfalls. The process transforms a molecular structure into a valid SMILES string through sequential steps, with potential failure points leading to common syntax errors.
To objectively compare molecular representations, researchers employ standardized evaluation metrics and experimental protocols:
Syntactic Validity: Percentage of generated strings that conform to the representation's grammatical rules, regardless of chemical feasibility [14] [13]. For SMILES, this includes proper parentheses matching and ring closure numbering.
Semantic Validity: Percentage of syntactically valid strings that correspond to chemically feasible molecules with proper valences and bonding patterns [14]. This metric specifically addresses chemical plausibility beyond mere syntax.
Novelty: Percentage of generated molecules not present in the training dataset, measuring the model's ability to explore new chemical space rather than memorizing training examples [13].
Uniqueness: Percentage of duplicate molecules within generated outputs, indicating diversity of sampling [14] [13].
Fréchet ChemNet Distance (FCD): Measures similarity between the distributions of generated molecules and training set molecules in a learned chemical space, with lower values indicating better distribution learning [13].
Goal-Directed Performance: Success rate in generating molecules satisfying specific property objectives, typically evaluated in benchmark tasks like penalized logP optimization or drug-likeness (QED) improvement [14].
Comparative studies typically utilize established chemical databases to ensure reproducible evaluation:
ChEMBL: A manually curated database of bioactive molecules with drug-like properties, containing approximately 2 million compounds [14] [13].
ZINC: A commercially available database of over 230 million purchasable compounds for virtual screening, frequently used for pre-training molecular generative models [14].
GDB-13: A enumerated database of nearly 1 billion small organic molecules containing up to 13 atoms of C, N, O, S, and Cl, following simple chemical stability and synthetic feasibility rules [13].
QM9: A comprehensive dataset of 134k stable small organic molecules with up to 9 heavy atoms (C, O, N, F) optimized at DFT level, providing quantum chemical properties [14].
Standardized experimental protocols enable fair comparison across molecular representations:
Data Preprocessing: All datasets are standardized by removing duplicates, invalid structures, and inorganic compounds. For SMILES, canonicalization may be applied to ensure consistent representation [11].
Model Architecture Consistency: Identical neural network architectures (e.g., LSTM, Transformer) are used across representations, with only the tokenization and embedding layers modified to accommodate different syntax [1] [13].
Training-Testing Splits: Standardized data splits (typically 80-10-10 for training-validation-testing) are used with random sampling to ensure comparable evaluation conditions [14].
Hyperparameter Optimization: Grid or random search is performed for each representation to identify optimal training parameters, acknowledging that different representations may require distinct hyperparameter configurations [1].
Statistical Significance Testing: Multiple training runs with different random seeds are conducted, with performance metrics reported as mean ± standard deviation to account for training stochasticity [13].
The following table summarizes key experimental parameters used in comparative studies of molecular representations:
Table 1: Standard Experimental Parameters for Molecular Representation Comparison
| Experimental Parameter | Typical Configuration | Variants | Purpose |
|---|---|---|---|
| Model Architecture | LSTM with 3 layers, 512 hidden units | Transformer, GRU, VAE | Ensure architecture consistency across representations |
| Training Dataset Size | 50k-250k molecules | 1M+ for pre-training | Balance computational cost and statistical power |
| Tokenization Method | Byte Pair Encoding (BPE) | Atom Pair Encoding (APE), Regex | Optimize sequence segmentation for each representation |
| Evaluation Metrics | Validity, Novelty, Uniqueness, FCD | Scaffold Similarity, Property Statistics | Comprehensive performance assessment |
| Goal-Directed Tasks | Penalized logP, QED, DRD2 | Multi-property optimization | Measure utility for practical molecular design |
The limitations of standard SMILES have motivated diverse approaches to improve molecular representation:
Canonical SMILES: Implements a deterministic algorithm to generate a unique SMILES string for each molecule, ensuring consistent representation across databases and software platforms [9]. While eliminating redundancy, canonicalization imposes an arbitrary traversal order that may not reflect chemical similarity.
DeepSMILES: Simplifies SMILES syntax by using postfixed ring closures and branch endings to reduce parentheses nesting, addressing common grammatical errors in generated strings [12] [14]. However, it still permits semantically invalid structures with incorrect atom valences [14].
SELFIES (Self-Referencing Embedded Strings): A robust representation where every possible string corresponds to a valid molecular graph through formal grammar that enforces chemical constraints [1] [13]. SELFIES achieves nearly 100% semantic validity by design but may produce less compact representations [13].
t-SMILES (tree-based SMILES): A fragment-based representation that describes molecules using SMILES-type strings obtained through breadth-first traversal of full binary trees derived from fragmented molecular graphs [14]. This approach introduces only two additional symbols (& and ^) while enabling multi-scale molecular description.
Recent systematic evaluations provide empirical data on the comparative performance of molecular representations:
Table 2: Performance Comparison of Molecular Representations on Standard Benchmarks
| Representation | Syntactic Validity (%) | Semantic Validity (%) | Novelty (%) | Uniqueness (%) | FCD (↓) | Goal-Directed Performance |
|---|---|---|---|---|---|---|
| SMILES | 96.8 ± 2.1 | 90.2 ± 3.5 | 99.3 ± 0.5 | 94.7 ± 2.3 | 1.24 ± 0.31 | Baseline |
| DeepSMILES | 98.5 ± 1.3 | 91.7 ± 2.8 | 98.9 ± 0.7 | 95.2 ± 1.9 | 1.31 ± 0.28 | -5% to +3% vs SMILES |
| SELFIES | 100 ± 0.0 | 100 ± 0.0 | 99.1 ± 0.6 | 93.8 ± 2.5 | 1.89 ± 0.42 | -12% to -27% vs SMILES |
| t-SMILES | 99.3 ± 0.7 | 98.5 ± 1.2 | 98.5 ± 0.9 | 97.4 ± 1.4 | 0.87 ± 0.19 | +15% to +38% vs SMILES |
Data compiled from comparative studies [14] [13] using ChEMBL and ZINC datasets evaluated under identical model architectures (LSTM with 512 hidden units) and training regimes.
The empirical data reveals fundamental trade-offs in molecular representation design:
Validity vs. Exploration: Representations guaranteeing 100% validity (SELFIES) demonstrate impaired distribution learning, as measured by higher FCD values, suggesting that the ability to generate invalid structures provides a self-corrective mechanism that filters low-likelihood samples [13]. Invalid SMILES are sampled with significantly lower likelihoods than valid ones, effectively serving as a built-in quality filter.
Fragmentation vs. Expressivity: Fragment-based approaches like t-SMILES reduce the search space and improve performance in goal-directed tasks but require predefined fragmentation rules that may introduce biases [14]. The optimal fragmentation strategy varies across chemical domains and optimization objectives.
Syntax Complexity vs. Learning Efficiency: Simplified grammars (DeepSMILES) reduce syntactic errors but may obscure meaningful chemical patterns, potentially explaining their inconsistent performance across tasks [14].
The following diagram illustrates the conceptual trade-offs between major molecular representations:
Figure 2: Trade-offs in Molecular Representation Design. Different representations balance chemical constraints, exploration capacity, and performance objectives differently, leading to distinct strength and weakness profiles.
SMILES alignment guidance addresses the non-uniqueness problem through techniques that standardize or leverage multiple SMILES representations:
SMILES Enumeration: Generating multiple non-canonical SMILES strings for each molecule through randomized atom ordering, effectively augmenting training datasets and improving model robustness [15]. Studies demonstrate this approach can yield 130-fold dataset expansion with corresponding improvements in property prediction accuracy (R² increase from 0.56 to 0.66) [15].
Root-Aligned SMILES (R-SMILES): Designating a common root atom for reactant and product SMILES in reaction prediction tasks, reducing edit distance by over 50% and concentrating model attention on the actual reaction center [15].
Latent Space Alignment: Encoding multiple SMILES representations per molecule using parallel RNNs followed by atom-level pooling to create nearly bijective latent representations, enabling more effective property optimization [15].
SMILES Arbitrary Target Specification (SMARTS) extends SMILES for substructural pattern matching rather than complete molecular representation [10]. Key features include:
Atomic Primitives: Enhanced atomic specification using symbols like [C] for aliphatic carbon, [#6] for any carbon by atomic number, and [c] for aromatic carbon [10].
Logical Operators: Boolean logic for atomic properties, including ! (negation), & (conjunction), and , (disjunction) to build complex queries [10].
Recursive SMARTS: Self-referential patterns using [$(*C)] to identify atoms connected to specific substructures, enabling sophisticated chemical environment queries [10].
TokenSMILES introduces a grammatical framework that standardizes SMILES into structured sentences composed of context-free words [12]. By applying five syntactic constraints (branch limitations, balanced parentheses, aromaticity exclusion), TokenSMILES minimizes redundant enumerations while maintaining valence compliance through semantic parsing rules [12]. Implemented in the open-source SmilX tool, this approach generates valid SMILES with accuracy comparable to existing implementations for molecules with low hydrogen deficiency (HDI ≤ 4) [12].
Table 3: Essential Software Tools and Resources for Molecular Representation Research
| Tool/Resource | Type | Primary Function | Application Context |
|---|---|---|---|
| RDKit | Open-source cheminformatics library | SMILES validation, manipulation, and visualization | Fundamental tool for molecular I/O operations and structure depiction |
| SmilX | Open-source tool | TokenSMILES implementation and grammatical validation | Grammar-based SMILES standardization and analysis |
| SELFIES Library | Python library | SELFIES encoding/decoding with guaranteed validity | Robust molecular generation with 100% validity guarantee |
| t-SMILES Framework | Python implementation | Fragment-based molecular representation | Multi-scale molecular description and generation |
| ChemBERTa | Pre-trained transformer model | Chemical language understanding | SMILES-based property prediction and embedding generation |
| ChEMBL Database | Curated chemical database | Bioactive molecule data with associated properties | Training data for predictive and generative models |
| ZINC Database | Commercial compound database | Purchasable compounds for virtual screening | Real-world chemical space for practical applications |
| Daylight Toolkit | Commercial cheminformatics toolkit | SMARTS pattern matching and chemical computation | Industrial-strength substructure searching and analysis |
This systematic analysis reveals that molecular representation selection involves nuanced trade-offs rather than universal superiority. SMILES remains widely adopted due to its balance of simplicity, expressivity, and compatibility with natural language processing techniques, despite its well-documented validity limitations [11] [9]. The perception of invalid SMILES as a critical shortcoming requires reconsideration in light of evidence that this property provides a self-corrective mechanism that filters low-likelihood samples and enhances distribution learning [13].
Emerging representations address specific SMILES limitations while introducing new considerations. SELFIES guarantees validity but may constrain chemical space exploration [13], while t-SMILES demonstrates superior performance in goal-directed tasks through fragment-based representation but requires careful fragmentation scheme selection [14]. TokenSMILES and grammatical approaches offer promising directions for standardized, interpretable representation [12].
Future research directions include hybrid representation systems that leverage complementary strengths, dynamic representation selection based on specific tasks, and integration of three-dimensional structural information currently absent from SMILES-based representations [14]. As chemical language models continue to evolve, the optimal molecular representation may ultimately depend on the specific application context, with different representations excelling in distribution learning, constrained optimization, or interpretability tasks.
In computational chemistry and AI-driven drug discovery, how molecules are represented as computer-readable strings is foundational. For decades, the Simplified Molecular Input Line Entry System (SMILES) has been the dominant language, providing a concise text-based representation of molecular structures [1]. However, SMILES has a significant weakness: its complex grammar means that randomly generated or machine-learned SMILES strings often represent chemically invalid or impossible molecules. This flaw hinders automated molecular design [16]. SELFIES (SELF-referencing Embedded Strings), a newer representation developed in 2019, was designed specifically to solve this problem. Its key innovation is a 100% robustness guarantee; every possible SELFIES string corresponds to a valid molecule, making it a powerful tool for generative AI models in chemistry and material science [16].
This guide objectively compares SELFIES against SMILES and other representations, providing researchers with the experimental data and methodological context needed to select the right tool for their computational workflows.
The fundamental difference between SELFIES and SMILES lies in their underlying grammar and how they handle complex molecular features.
SELFIES is based on a formal grammar (Chomsky type-2). One can think of a SELFIES string as a small computer program that a compiler runs to build a molecular graph. This derivation process uses a "state" or minimal memory that tracks valences and bonding patterns. This stateful derivation prevents physical impossibilities, such as an oxygen atom forming four bonds, which a stateless SMILES string might inadvertently allow [16].
The diagram below visualizes the derivation process of a SELFIES string into a molecular graph.
Multiple studies have quantitatively evaluated SILES and SMILES across various AI model types and tasks. The following tables summarize key experimental findings.
Table 1: Performance in Molecular Property Prediction Tasks (ROC-AUC)
| Model / Task | SMILES (with Augmentation) | SELFIES (with Augmentation) | Performance Delta | Notes & Source |
|---|---|---|---|---|
| QK-LSTM on SIDER Dataset (Classical) | 0.821 (Baseline) | 0.869 | +5.97% | [2] |
| QK-LSTM on SIDER Dataset (Hybrid Quantum) | 0.819 (Baseline) | 0.868 | +5.91% | [2] |
| BERT-based Model (HIV, Tox, BBBP) | Varies by tokenizer | Varies by tokenizer | N/A | APE tokenizer with SMILES often led [1] |
Table 2: Performance in Generative Modeling and De Novo Design
| Metric / Model Type | SMILES | SELFIES | Notes & Source |
|---|---|---|---|
| Validity Rate (General) | Often <80% in early generative models | 100% (by design) | [16] |
| Validity Rate (Diffusion Model) | High (Metric-dependent) | High (Metric-dependent) | Performance is task-dependent [17] |
| Novelty (Diffusion Model) | High | Moderate | IUPAC led in novelty [17] |
| Diversity / QED Score (Diffusion Model) | Good (SAscore leader) | Best (Tied with SMARTS) | [17] |
| Application in STONED Algorithm | Limited by low validity | Enabled highly efficient exploration | [16] |
Table 3: Direct Comparison of Core Features and Attributes
| Feature | SMILES | SELFIES |
|---|---|---|
| Core Principle | Graph traversal string | Formal grammar-based derivation |
| Robustness Guarantee | No | Yes |
| Handling of Aromaticity | Lowercase atoms / ':' symbol | Symbol-based, localized |
| Representation of Rings | Two numbers (non-local) | One symbol + length (local) |
| Representation of Branches | Parentheses (non-local) | One symbol + length (local) |
| Human Readability | High (for trained chemists) | Moderate |
| Machine Learnability | Complex grammar can be challenging | Evidence suggests it is easier for models [16] |
To critically assess the data in the tables, it is essential to understand the experimental designs that generated them.
A 2025 study provided the first analysis of data-augmented SELFIES for molecular property prediction [2].
A 2025 comparative study benchmarked four representations (SMILES, SELFIES, SMARTS, IUPAC) using a state-of-the-art diffusion model [17].
The workflow of this comparative study is summarized in the diagram below.
For researchers interested in implementing SELFIES in their projects, the following tools and resources are essential.
Table 4: Key Research Reagents and Computational Tools
| Item / Resource | Function / Purpose | Availability / Installation |
|---|---|---|
selfies Python Library |
The core library for converting between SMILES and SELFIES, and for manipulating SELFIES strings. | pip install selfies [16] |
| SELFIES Documentation | Comprehensive guide on syntax, API, and use cases. | GitHub Repository [16] |
| Standard Cheminformatics Libraries | Used in conjunction with SELFIES for molecule handling and validation (e.g., RDKit, OpenBabel). | - |
| Benchmark Datasets (e.g., MoleculeNet) | Standardized datasets like SIDER for training and fairly comparing model performance. | MoleculeNet [2] |
| Pre-trained Models (e.g., ChemBERTa) | Transformer models pre-trained on large molecular corpora (often in SMILES) that can be fine-tuned. | Hugging Face / Literature [1] |
The experimental evidence clearly demonstrates that SELFIES is not just a theoretical improvement but a practical solution to the robustness problem that plagued SMILES in generative AI. Its guaranteed validity simplifies model architectures, enables new combinatorial and evolutionary algorithms like STONED, and shows competitive, if not superior, performance in key predictive and generative tasks.
However, the "best" representation is context-dependent. While SELFIES is unparalleled for robustness and is a top choice for generative design, other representations like SMILES (with modern tokenizers) can excel in specific predictive tasks, and IUPAC can offer advantages in novelty [1] [17]. The future lies in developing representation-agnostic models and hybrid approaches that can leverage the unique strengths of each molecular language to accelerate the discovery of new functional molecules and materials.
Molecular representations form the foundational layer of computational chemistry and drug discovery, serving as the critical bridge between a chemical structure and its digital interpretation by algorithms. The accurate encoding of molecular features—especially complex structural aspects like rings, branches, and aromaticity—is paramount for predicting biological activity, physicochemical properties, and ultimately, the success of AI-driven drug design campaigns. This guide objectively compares the performance of three predominant molecular representation methods: the Simplified Molecular-Input Line-Entry System (SMILES), SELFIES (SELF-referencing Embedded Strings), and graph-based representations. Framed within a broader thesis on evaluating the accuracy of these representations, this analysis synthesizes current research to illustrate how each method handles key structural elements, with a particular focus on their implications for real-world research and development applications. The choice of representation directly influences a model's ability to navigate chemical space, perform valid scaffold hopping, and generate novel therapeutic compounds, making this comparison essential for researchers and drug development professionals.
Introduced in 1988, SMILES is a line notation method that describes the structure of a chemical species using short ASCII strings. It encodes a molecular graph as a sequence of characters representing atoms, bonds, brackets for branches, and numbers for ring closures. For instance, the SMILES for benzene is simply c1ccccc1, denoting a six-membered aromatic ring. Its widespread adoption is due to its human-readability and compact form. However, a significant limitation is that a single molecule can have multiple valid SMILES strings, which can complicate learning for AI models. Furthermore, SMILES does not inherently guarantee syntactic or semantic validity; a small change in the string can lead to an invalid molecular structure, posing a challenge for generative models [3].
SELFIES is a more recent string-based representation (a successor to SMILES) designed specifically to address the issue of validity in generative AI models. The key innovation of SELFIES is its syntax, which ensures that every possible string corresponds to a valid molecular graph. It uses a grammar of derived rules that self-reference the growing molecular structure, preventing the formation of impossible bonds or atoms. This makes SELFIES exceptionally robust for de novo molecular design and machine learning applications, as it virtually eliminates the generation of invalid structures during the exploration of chemical space, thereby improving the efficiency of AI-driven discovery pipelines [3].
Graph-based representations model a molecule directly as a mathematical graph, where atoms are represented as nodes and bonds as edges. This approach most closely mirrors the actual topological structure of a molecule. In computational models, particularly Graph Neural Networks (GNNs), this native representation allows for the direct propagation and aggregation of information between connected atoms. This inherent structural fidelity enables GNNs to excel at capturing complex relational patterns and physical inductive biases within a molecule, leading to state-of-the-art performance on many predictive tasks. Unlike string-based methods, graphs are inherently invariant to the ordering of atoms, which can simplify the learning process [3].
The following tables summarize the performance of SMILES, SELFIES, and graph-based representations across various benchmark tasks and their specific handling of key structural features.
Table 1: Overall Performance on Benchmark Molecular Property Prediction Tasks (MoleculeNet)
| Representation | Average AUC (k=4) | Average AUC (k=6) | Handling of Syntactic Validity | Interpretability |
|---|---|---|---|---|
| SMILES | 0.811 | 0.798 | Low | Medium |
| SELFIES | 0.819 | 0.805 | High (100%) | Medium |
| Molecular Graph | 0.832 | 0.821 | Inherent | High |
| Multi-View (MoL-MoE) | 0.847 | 0.839 | Varies by component | Medium |
Table 2: Handling Key Structural Features - A Qualitative and Functional Comparison
| Structural Feature | SMILES | SELFIES | Graph-Based |
|---|---|---|---|
| Ring Systems | Encoded via ring closure numbers (e.g., C1CCCC1 for cyclopentane). Can be ambiguous and invalid. |
Encoded with guaranteed validity; derived rules prevent ring errors. | Directly represented as cycles in the graph; inherently valid. |
| Branching | Handled with parentheses (e.g., CC(O)C for isopropanol). Syntax errors can create invalid branches. |
Robust branching via a self-referencing grammar that ensures correctness. | Directly represented as tree-like node connections; no syntax issues. |
| Aromaticity | Implicitly denoted by lowercase atom symbols (e.g., c for aromatic carbon). Must be perceived by the model. |
Similar implicit encoding as SMILES, but with validity guarantees. | Often treated as a formal bond type or as a node/edge feature; explicit. |
| Representation Invariance | Low (single molecule has multiple valid strings). | Low (similar to SMILES). | High (inherently invariant to atom ordering). |
| Primary Strength | Human-readable, widespread use. | Guaranteed molecular validity. | Native structural representation, high predictive accuracy. |
The data in Table 1, derived from studies on Multi-View Mixture-of-Experts models, demonstrates that while graph-based representations often lead in predictive accuracy on tasks like toxicity and solubility prediction, SELFIES provides a crucial advantage in validity-guaranteed generation. The MoL-MoE framework, which integrates SMILES, SELFIES, and graph views, consistently achieves superior performance by leveraging the complementary strengths of each representation [18]. Table 2 highlights the core structural differences: SMILES and SELFIES are sequential and can struggle with invariance, while graphs are topological and naturally invariant. SELFIES' main contribution is its robust handling of ring and branch syntax to ensure validity, whereas graphs excel at natively representing complex interconnected structures.
A pivotal experiment in evaluating molecular representations is the implementation of a Multi-View Mixture-of-Experts (MoL-MoE) framework. This methodology involves:
Experimental protocols for evaluating how well representations capture aromaticity often rely on quantum chemical calculations, which serve as ground truth.
Table 3: Essential Software Tools for Molecular Representation and Analysis
| Tool Name | Type/Function | Application in Representation Research |
|---|---|---|
| RDKit | Cheminformatics Toolkit | The de facto standard for converting between SMILES, SELFIES, and graph representations, calculating molecular descriptors, and generating fingerprints. |
| PyMOL | Molecular Visualization System | Used for creating publication-quality images of molecular structures, validating ring systems, and aromaticity visualization [20]. |
| VMD | Visual Molecular Dynamics | A platform for displaying, animating, and analyzing large biomolecular systems, useful for analyzing dynamic molecular behavior [21]. |
| ChimeraX | Next-Gen Molecular Modeling | An interactive system for analysis and presentation graphics of molecular structures and related data, with advanced visualization features [20]. |
| Deep Learning Frameworks (PyTorch, TensorFlow) | ML Model Development | Essential for building and training Graph Neural Networks, Transformers, and other models that learn from molecular representations. |
| MoleculeNet | Benchmark Dataset Collection | A standardized set of molecular property datasets used to fairly evaluate and compare the performance of different representation methods. |
The following diagram illustrates the workflow of a Multi-View Mixture-of-Experts (MoL-MoE) system, which integrates multiple molecular representations to enhance predictive performance.
This workflow demonstrates how the MoL-MoE framework processes SMILES (yellow), SELFIES (red), and molecular graphs (blue) through dedicated encoders. The gating network (red diamond) then dynamically weights the contributions of the various experts (green), which process features from one or more representations, to produce a final, more accurate prediction. This architecture leverages the complementary strengths of each representation, allowing the model to, for instance, use the graph's topological accuracy for ring-based properties and SELFIES' validity for generative tasks [18].
The evaluation of SMILES, SELFIES, and graph-based representations reveals a landscape defined by trade-offs. No single representation is universally superior; each excels in different facets of the molecular modeling pipeline. SMILES remains a versatile and widely supported standard but is hampered by validity issues. SELFIES presents a powerful solution for generative chemistry by guaranteeing molecular validity, ensuring that AI models explore syntactically correct regions of chemical space. Graph-based representations, by most closely mirroring the true structure of a molecule, consistently deliver high predictive accuracy for property forecasting.
The most promising future direction, as evidenced by the superior performance of the Multi-View MoL-MoE model, lies in hybrid approaches. By integrating the unique strengths of each representation—the sequential patterns in SMILES, the syntactic robustness of SELFIES, and the topological fidelity of graphs—researchers can build more powerful, accurate, and generalizable AI tools for drug discovery. This multi-faceted approach will be crucial for tackling complex challenges such as accurate scaffold hopping, de novo molecular design, and the reliable prediction of complex ADMET properties, ultimately accelerating the development of new therapeutics.
In the field of computational chemistry and drug discovery, machines do not perceive molecules as physical structures but as specialized languages. Molecular string representations like SMILES (Simplified Molecular-Input Line-Entry System) and SELFIES (SELF-referencing Embedded Strings) have become the predominant alphabets for this dialogue [1]. However, for machine learning models to comprehend these strings, they must first be broken down into digestible pieces through a process called tokenization. This process is far from trivial; the choice of how a molecule is split into tokens can significantly influence a model's ability to predict drug toxicity, biological activity, or optimize for a desired property [22] [23].
This guide objectively compares the performance of leading tokenization techniques applied to SMILES and SELFIES representations. By synthesizing recent research and experimental data, we provide a clear framework for researchers and drug development professionals to select the most appropriate tokenization strategy for their specific applications.
Before delving into tokenization, it is essential to understand the two primary chemical string representations they are designed to process.
The following workflow outlines the typical process of using these representations in a machine learning pipeline, from raw molecule to model prediction:
Tokenization is the crucial bridge between a raw chemical string and a machine-learning model. It defines the model's basic vocabulary and influences how it "understands" chemical grammar. The table below summarizes the core tokenization methods evaluated in recent literature.
| Tokenization Method | Core Principle | Key Advantages | Key Limitations |
|---|---|---|---|
| Atom-wise [22] [25] | Splits string into atoms, bonds, and rings using rules/regular expressions. | Chemically intuitive; simple to implement; widely used. | Treats identical atoms the same regardless of environment; may not capture context [23]. |
| Byte Pair Encoding (BPE) [1] | Data-driven; iteratively merges most frequent character pairs. | Reduces vocabulary size; can capture common substructures. | Can create chemically ambiguous tokens; may split atoms unnaturally [25]. |
| Atom Pair Encoding (APE) [1] [26] | Starts with atom-wise tokens, then learns to merge frequent atom pairs. | Preserves chemical integrity better than BPE; enhances contextual relationships [1]. | A closed-vocabulary method; performance depends on training data. |
| Atom-in-SMILES (AIS) [23] | Represents each atom by its local chemical environment (like a circular fingerprint). | Highly chemically accurate; reduces token ambiguity and repetition. | Increases sequence length and vocabulary size; more complex to implement. |
| Smirk [25] | Fully decomposes complex atoms (e.g., [C@@H]) into constituent glyphs. |
Guarantees complete coverage of the SMILES specification; no unknown tokens. | Can lead to long sequence lengths; may increase computational cost. |
Numerous studies have systematically evaluated how these tokenization strategies, combined with different molecular representations, impact model performance on standard chemical tasks.
The following table consolidates quantitative results from key studies, measured by the Area Under the Receiver Operating Characteristic curve (ROC-AUC) for classification tasks and Root Mean Square Error (RMSE) for regression tasks. Higher AUC and lower RMSE indicate better performance.
Table: Performance Comparison of Tokenization Schemes on MoleculeNet Benchmarks (ROC-AUC / RMSE)
| Model / Tokenizer | HIV | BBBP | Tox21 (SR-p53) | ClinTox | ESOL (RMSE) |
|---|---|---|---|---|---|
| SMILES + BPE [1] [22] | 0.769 | 0.859 | 0.817 | 0.913 | ~1.0 |
| SMILES + Atom-wise [22] | 0.771 | 0.899 | 0.824 | 0.919 | ~1.0 |
| SMILES + APE [1] | 0.784 | 0.914 | 0.841 | - | - |
| SELFIES + BPE [1] [22] | 0.763 | 0.858 | 0.815 | 0.915 | ~1.0 |
| SELFIES + Atom-wise [22] | 0.772 | 0.897 | 0.823 | 0.918 | ~1.0 |
| SELFIES + APE [1] | 0.778 | 0.909 | 0.838 | - | - |
| SMILES + AIS [23] | - | - | - | - | 0.88 |
The experimental workflows cited in this guide rely on a suite of software tools and databases that form the essential toolkit for modern computational chemistry research.
Table: Key Research Tools and Databases for Chemical Language Modeling
| Item | Function | Relevance to Tokenization Research |
|---|---|---|
| RDKit [27] [22] | Open-source cheminformatics toolkit. | Used for generating canonical SMILES, drawing molecular images, calculating descriptors, and validating chemical structures. |
| Hugging Face Transformers/Tokenizers [1] [25] | Library for state-of-the-art NLP. | Provides implementations of transformer models (BERT, RoBERTa) and tokenizers (BPE, SentencePiece), which are adapted for chemical language. |
| PubChem [22] [25] | NIH's database of chemical molecules and their activities. | A primary source for large-scale, unlabeled molecular data used for pre-training chemical language models. |
| ChEMBL [27] [24] | Manually curated database of bioactive molecules with drug-like properties. | Commonly used for supervised fine-tuning on tasks related to drug discovery and toxicology. |
| MoleculeNet [27] [22] | A benchmark suite for molecular machine learning. | Provides standardized train/validation/test splits for fair evaluation of model performance on diverse chemical tasks. |
| SELFIES Python Library [22] | Official library for the SELFIES representation. | Enables conversion between SMILES and SELFIES, ensuring all strings represent valid molecules. |
The "tokenization challenge" in chemical language modeling does not have a single, universal solution. The optimal choice is a strategic decision that balances performance, interpretability, and computational cost.
As the field progresses, the synergy between chemically intelligent representations like SELFIES and advanced, context-aware tokenizers like APE and AIS will continue to push the boundaries of what's possible in accelerated drug discovery and materials science.
In computational chemistry, machine learning models rely on tokenizers to convert molecular string representations, such as SMILES and SELFIES, into manageable subunits. The choice of tokenization strategy significantly influences a model's ability to understand chemical structure and predict properties accurately. Byte Pair Encoding (BPE) is a widely adopted subword tokenization method in natural language processing that has been transferred to chemical language models [1]. However, its limitations in capturing chemical semantics have spurred the development of domain-specific alternatives like Atom Pair Encoding (APE) [1]. This guide objectively compares BPE and APE, providing experimental data to help researchers select the optimal tokenization strategy for molecular representation tasks within drug development pipelines.
BPE is a data compression technique that constructs a vocabulary by iteratively merging the most frequent pairs of characters or bytes [1]. In chemical language processing, BPE is typically applied to SMILES or SELFIES strings after an initial pre-tokenization step that often uses regular expressions to split the string into smaller pretokens [28]. A significant limitation of this approach is that pre-tokenization forces most common words (or molecular fragments) to be represented as single tokens, resulting in a skewed token distribution where frequent tokens dominate, and the vocabulary additions from larger sizes provide diminishing returns [29]. When applied to molecular representations, BPE often fails to capture the contextual relationships between chemical elements, as it operates purely on character frequency without chemical intelligence [1].
Atom Pair Encoding (APE) is a novel tokenization method specifically designed for chemical languages [1]. APE positions itself as a fusion of atom-wise tokenization and BPE principles [25]. Instead of starting from raw characters, APE begins with atom-level tokens and learns to merge adjacent pairs that frequently co-occur, forming chemically meaningful fragments [1]. This approach preserves the integrity and contextual relationships among chemical elements more effectively than BPE, as the merges are informed by the underlying chemical structure rather than just character statistics [1]. APE was developed to address BPE's limitations in capturing the structural nuances of molecules, thereby enhancing model performance on chemical property prediction tasks [1].
Experimental studies have systematically compared BPE and APE tokenization within BERT-based models on standardized molecular property prediction tasks. Performance is typically evaluated using the ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) metric on benchmark datasets like HIV, toxicology (Tox21), and blood-brain barrier penetration (BBBP) from MoleculeNet [1] [26].
Table 1: Performance Comparison (ROC-AUC) of BPE vs. APE with SMILES and SELFIES Representations
| Dataset | Tokenization | Molecular Representation | ROC-AUC Score |
|---|---|---|---|
| HIV | BPE | SMILES | Baseline |
| HIV | APE | SMILES | Significant Improvement [1] |
| Toxicology (Tox21) | BPE | SMILES | Baseline |
| Toxicology (Tox21) | APE | SMILES | Significant Improvement [1] |
| Blood-Brain Barrier (BBBP) | BPE | SMILES | Baseline |
| Blood-Brain Barrier (BBBP) | APE | SMILES | Significant Improvement [1] |
| Various | BPE | SELFIES | Comparable to SMILES+BPE [1] |
| Various | APE | SELFIES | Improved over SELFIES+BPE [1] |
The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE across these tasks [1]. The performance advantage is attributed to APE's ability to preserve the integrity and contextual relationships among chemical elements, thereby enhancing the model's classification accuracy [1]. While SELFIES representations offer the advantage of guaranteed molecular validity, models using SELFIES with BPE achieved results comparable to, but not superior than, SMILES with BPE [1]. APE also demonstrated improved performance with SELFIES compared to using BPE with SELFIES [1].
To ensure reproducibility and provide clarity on the data presented, this section outlines the standard methodologies used in the key experiments comparing BPE and APE.
The following diagram illustrates the comparative workflow for tokenizing a molecular string using BPE and APE:
The following table details key resources, including tokenizers, molecular representations, and datasets, essential for conducting experiments in chemical language modeling.
Table 2: Key Reagents and Resources for Chemical Tokenization Research
| Category | Item | Function & Description | Example Sources / libraries |
|---|---|---|---|
| Molecular Representations | SMILES (Simplified Molecular Input Line Entry System) | Linear string notation representing molecular structure using ASCII characters [1]. | RDKit, OpenSMILES |
| SELFIES (Self-Referencing Embedded Strings) | A robust molecular representation guaranteed to produce valid molecules from every string, addressing SMILES validity issues [1] [2]. | SELFIES library | |
| Tokenization Algorithms | Byte Pair Encoding (BPE) | A general-purpose subword tokenization algorithm that merges frequent character pairs [1]. | Hugging Face Tokenizers, SentencePiece |
| Atom Pair Encoding (APE) | A chemistry-specific tokenizer that merges frequent adjacent atom-level tokens to form meaningful fragments [1]. | Custom implementations (research code) | |
| Smirk | An atomically complete tokenizer that fully decomposes SMILES for maximum coverage and open-vocabulary modeling [28] [25]. | Smirk (GitHub) | |
| Model Architectures | BERT-based Models | Transformer-based encoder models effective for pre-training on unlabeled data and fine-tuning for classification tasks [1]. | Hugging Face Transformers, ChemBERTa |
| Benchmarking Datasets | MoleculeNet | A benchmark collection of molecular datasets for evaluating machine learning models [2]. | HIV, Tox21, BBBP, SIDER |
| Software Libraries | Hugging Face Ecosystem | Provides open-source implementations of tokenizers, models, and training utilities, standardizing the experimental pipeline [1]. | Tokenizers, Transformers, Datasets |
The experimental evidence demonstrates that Atom Pair Encoding (APE) provides a significant performance advantage over generic Byte Pair Encoding (BPE) for molecular property prediction tasks. APE's core strength lies in its domain-specific design, which preserves chemical context and structural relationships more effectively than frequency-based BPE.
For researchers and scientists in drug development, the choice of tokenizer should align with project goals. APE is recommended when maximizing predictive accuracy on tasks like toxicity, activity, or permeability prediction is critical. Its ability to form chemically meaningful tokens directly enhances model comprehension. BPE may still be suitable for more general-purpose molecular modeling or when leveraging pre-existing, well-supported NLP pipelines, though with an expected performance trade-off. Furthermore, the emergence of open-vocabulary tokenizers like Smirk presents a promising alternative, ensuring coverage for diverse and complex molecules, including organometallics and inorganics, which are often poorly handled by closed-vocabulary approaches [28] [25].
In the field of computational chemistry and drug discovery, converting molecular structures into machine-readable numerical embeddings represents a foundational step for any AI-driven research pipeline. The choice of molecular representation directly influences the performance of downstream models, including Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Transformers, in predicting critical properties such as drug efficacy, toxicity, and metabolic characteristics [30]. This guide provides an objective comparison of the predominant string-based representations—SMILES and SELFIES—focusing on their experimental performance across different neural architectures. With the rising adoption of AI in pharmaceutical research, understanding the empirical strengths and limitations of these representations has become essential for building effective predictive models [31]. The following analysis synthesizes recent benchmarking studies to guide researchers in selecting optimal molecular embeddings for their specific applications.
String-based representations allow complex molecular structures to be processed by sequence models originally developed for natural language. The two primary formats offer different advantages and limitations.
SMILES (Simplified Molecular-Input Line-Entry System): This notation represents molecular structures using a compact string of ASCII characters, where atoms are denoted by elemental symbols, bonds by specific characters (e.g., '-', '=', '#'), branches by parentheses, and rings by numerical markers [2]. While widely adopted, SMILES has inherent limitations: it can generate multiple valid strings for the same molecule, lacks strict valency checks, and often produces syntactically or chemically invalid strings when processed by generative models [7].
SELFIES (Self-Referencing Embedded Strings): Developed to address SMILES limitations, SELFIES uses a robust grammar where every possible string corresponds to a valid molecular structure [2] [7]. Instead of complex grammar for rings and branches, SELFIES uses single symbols to represent these structural elements, explicitly encoding their length or size. This guarantee of validity makes SELFIES particularly valuable for generative tasks and applications requiring robust automated processing [2].
Recent studies have quantitatively evaluated how SMILES and SELFIES representations perform across different machine learning models and tasks. The table below summarizes key experimental findings from benchmark studies.
Table 1: Performance Comparison of SMILES vs. SELFIES Across Different Models and Tasks
| Model Architecture | Dataset/Task | Representation | Performance Metric | Result | Key Finding |
|---|---|---|---|---|---|
| QK-LSTM [2] | Molecular Property Prediction | Augmented SMILES | ROC-AUC | Baseline | Augmenting SELFIES yielded a 5.97% improvement in classical models and a 5.91% improvement in hybrid quantum-classical models compared to augmented SMILES. |
| QK-LSTM [2] | Molecular Property Prediction | Augmented SELFIES | ROC-AUC | +5.97% (Classical), +5.91% (Hybrid) | |
| Domain-Adapted Transformer [7] | ESOL (Solubility) | SMILES (ChemBERTa-zinc) | RMSE | ~1.130 (estimated from baseline) | A SMILES-pretrained model adapted to SELFIES matched or exceeded its original SMILES performance, achieving an RMSE of 0.944 on ESOL without architectural changes. |
| Domain-Adapted Transformer [7] | ESOL (Solubility) | SELFIES (Adapted) | RMSE | 0.944 | |
| Domain-Adapted Transformer [7] | FreeSolv (Hydration) | SELFIES (Adapted) | RMSE | 2.511 | The domain-adapted SELFIES model demonstrated effective knowledge transfer from SMILES, performing well on multiple physicochemical property prediction tasks. |
| Domain-Adapted Transformer [7] | Lipophilicity | SELFIES (Adapted) | RMSE | 0.746 | |
| SELFormer [7] | ESOL (Solubility) | SELFIES | RMSE | ~15% improvement over GEM (GNN) | A transformer pretrained from scratch on SELFIES (SELFormer) showed significant gains over a geometry-based graph neural network. |
| SELFormer [7] | SIDER (Side Effects) | SELFIES | ROC-AUC | ~10% improvement over MolCLR (GNN) | SELFormer also increased ROC-AUC by 10% on a key biomedical benchmark dataset. |
A comprehensive benchmarking study of 25 pretrained molecular embedding models across 25 datasets revealed a surprising result: nearly all neural models showed negligible or no improvement over the traditional ECFP (Extended Connectivity Fingerprint) molecular fingerprint [32]. Only one model, CLAMP (which also leverages fingerprint-based inputs), performed statistically significantly better than alternatives. This finding highlights that while modern neural representations are promising, traditional fingerprints remain strong baselines due to their computational efficiency and proven performance [32].
To ensure reproducible and fair comparisons, researchers have developed standardized protocols for evaluating molecular representations. This section details two key methodological approaches used in recent studies.
The Augmented Molecular Retrieval (AMORE) framework provides a zero-shot method to assess the chemical understanding of language models by testing their robustness to different SMILES representations of the same molecule [4]. The methodology is visualized below.
Diagram 1: AMORE framework for evaluating representation robustness.
Experimental Protocol:
SMILES Augmentation: Generate multiple valid SMILES strings for each molecule in the dataset through permutations such as randomized atom order, different branch arrangements, and varying ring labels [4]. These are identity transformations that do not change the underlying chemical structure.
Embedding Generation: Encode both original and augmented SMILES strings using the chemical language model being evaluated to produce vector embeddings for each representation [4].
Similarity Calculation: Compute distances (e.g., cosine similarity or Euclidean distance) between embeddings of different SMILES representations of the same molecule [4].
Robustness Assessment: A robust model should produce similar embeddings for all SMILES variants of the same molecule. If the nearest embedding to an augmented SMILES is not from the original molecule, it indicates the model may be overfitting to textual patterns rather than learning chemical semantics [4].
Experiments using AMORE have indicated that many chemical language models are still not robust to different SMILES representations, highlighting a significant gap in true chemical understanding [4].
Domain-adaptive pretraining enables efficient adaptation of existing SMILES-based models to process SELFIES representations without expensive pretraining from scratch [7].
Table 2: Key Experimental Parameters for Domain Adaptation to SELFIES
| Parameter | Specification | Rationale |
|---|---|---|
| Base Model | ChemBERTa-zinc-base-v1 | A transformer pretrained on SMILES strings from the ZINC database [7]. |
| Adaptation Data | ~700,000 molecules from PubChem | Sufficient data to learn SELFIES syntax while maintaining computational efficiency [7]. |
| Data Format | SELFIES strings | Target representation format for model adaptation [7]. |
| Training Objective | Masked Language Modeling (MLM) | Standard self-supervised objective where the model learns to predict randomly masked tokens [7]. |
| Computational Resource | Single NVIDIA A100 GPU | Enables completion within 12 hours, making the approach accessible [7]. |
| Tokenizer | Original SMILES tokenizer | No vocabulary changes, testing adaptability despite syntactic differences [7]. |
Methodology:
Tokenizer Feasibility Check: Verify that the original SMILES tokenizer can process SELFIES strings without excessive unknown tokens ([UNK]) or truncation, as SELFIES shares much of its atomic vocabulary with SMILES [7].
Continued Pretraining: Perform masked language modeling on the SELFIES-formatted PubChem dataset using the original model architecture and tokenizer [7].
Evaluation: Assess the adapted model on downstream tasks using both embedding-level analysis (e.g., t-SNE visualization, property prediction with frozen embeddings) and full fine-tuning on benchmark datasets like ESOL, FreeSolv, and Lipophilicity [7].
This protocol demonstrates that a SMILES-pretrained transformer can be successfully adapted to SELFIES, achieving competitive performance with significantly reduced computational cost compared to training models from scratch [7].
Table 3: Key Research Tools and Resources for Molecular Representation Experiments
| Resource Name | Type | Primary Function | Relevance to Representation Learning |
|---|---|---|---|
| MoleculeNet [2] [7] [4] | Benchmark Dataset Collection | Standardized evaluation datasets and metrics for molecular machine learning. | Provides consistent benchmarks (e.g., ESOL, SIDER, FreeSolv) for fair comparison of different representations and models. |
| SIDER [2] | Specialized Dataset | Curates data on drug side effects organized by organ class. | Critical for evaluating performance on real-world biomedical prediction tasks relevant to drug safety. |
| RDKit [7] | Cheminformatics Toolkit | Open-source software for cheminformatics and molecular manipulation. | Converts molecules between formats, computes descriptors, and handles SMILES/SELFIES conversion for data preprocessing. |
| PubChem [7] | Chemical Database | Public repository of chemical substances and their biological activities. | Source of large-scale, diverse molecular structures for pretraining and domain-adaptive pretraining. |
| ZINC-15 [4] | Commercial Compound Database | Curated collection of commercially available compounds for virtual screening. | Common source of millions of SMILES strings for large-scale pretraining of chemical language models. |
| ECFP Fingerprints [32] | Traditional Molecular Representation | Circular fingerprints encoding molecular substructures into fixed-length bit vectors. | Strong traditional baseline for comparing the performance of modern neural embedding approaches. |
| AMORE Framework [4] | Evaluation Framework | Zero-shot method for assessing representation robustness using SMILES augmentations. | Tests whether models learn true chemical semantics or merely memorize textual patterns in SMILES strings. |
Experimental evidence indicates that SELFIES representations consistently match or surpass SMILES across various model architectures, including LSTMs and Transformers, particularly when proper data augmentation or domain adaptation strategies are employed [2] [7]. The guaranteed validity of SELFIES strings offers significant advantages for generative applications and robust property prediction. However, traditional fingerprints like ECFP remain surprisingly competitive baselines that should not be overlooked in benchmarking studies [32].
Future research directions include developing more sophisticated cross-modal representations that integrate string-based, graph-based, and 3D structural information [31] [33]. Methods like OmniMol, which uses a hypergraph structure to model relationships between molecules and multiple properties, show promise for handling imperfectly annotated data and improving model explainability [33]. As the field progresses, standardized evaluation frameworks like AMORE [4] and comprehensive benchmarking [32] will be crucial for accurately measuring progress in molecular representation learning for drug discovery.
The accurate prediction of molecular properties is a critical step in accelerating drug discovery, particularly for complex targets like the Human Immunodeficiency Virus (HIV), toxicological endpoints, and the blood-brain barrier (BBB). The choice of molecular representation—how a chemical structure is converted into a computable format—fundamentally shapes the performance of these predictive models. Simplified Molecular Input Line Entry System (SMILES) and SELF Referencing Embedded Strings (SELFIES) are two prominent string-based representations. This guide provides an objective comparison of their performance, supported by experimental data, to help researchers select the optimal representation for their specific property prediction tasks. Within the broader thesis of evaluating molecular representations, the evidence indicates that while SMILES, when paired with advanced tokenization, can achieve high accuracy, SELFIES offers inherent advantages in robustness for generative tasks.
Quantitative data from recent studies provide a direct comparison of model performance using SMILES and SELFIES representations across key property prediction tasks. The following table summarizes the core findings from a benchmark study that evaluated these representations using different tokenization strategies in BERT-based models.
Table 1: Performance Comparison (ROC-AUC) of SMILES and SELFIES with Different Tokenization Methods [1]
| Property Prediction Task | Representation | Tokenization Method | ROC-AUC Score |
|---|---|---|---|
| HIV | SMILES | Atom Pair Encoding (APE) | 0.826 |
| SMILES | Byte Pair Encoding (BPE) | 0.813 | |
| SELFIES | Byte Pair Encoding (BPE) | 0.810 | |
| Toxicity (Tox21) | SMILES | Atom Pair Encoding (APE) | 0.856 |
| SMILES | Byte Pair Encoding (BPE) | 0.842 | |
| SELFIES | Byte Pair Encoding (BPE) | 0.840 | |
| Blood-Brain Barrier (BBB) Penetration | SMILES | Atom Pair Encoding (APE) | 0.951 |
| SMILES | Byte Pair Encoding (BPE) | 0.944 | |
| SELFIES | Byte Pair Encoding (BPE) | 0.938 |
The comparative data presented in Table 1 were generated through a standardized experimental protocol. The following workflow details the key steps involved in such a benchmarking study.
Figure 1: Workflow for benchmarking molecular representations.
Dataset Curation: The study utilized three publicly available benchmark datasets:
Molecular Representation and Tokenization:
Model Training and Evaluation:
To implement the experimental protocols described, researchers rely on a suite of computational tools, datasets, and platforms. The following table outlines key resources for conducting research in this field.
Table 2: Essential Research Toolkit for Molecular Property Prediction
| Tool/Resource Name | Type | Primary Function | Relevance to Research |
|---|---|---|---|
| SMILES | Molecular Representation | A line notation for representing molecular structures using ASCII strings [1]. | The established standard for string-based molecular input; used as a baseline for comparison. |
| SELFIES | Molecular Representation | A 100% robust molecular representation where every string is syntactically and semantically valid [1]. | Critical for generative models and de novo molecular design to avoid invalid outputs. |
| Atom Pair Encoding (APE) | Tokenization Method | A novel tokenizer that breaks down SMILES/SELFIES by preserving atom-bond pairs [1]. | Enhances model performance by maintaining chemical context, as shown in benchmark studies. |
| Byte Pair Encoding (BPE) | Tokenization Method | A sub-word tokenization algorithm that iteratively merges frequent character sequences [1]. | A standard, general-purpose tokenizer used as a baseline for comparing tokenization strategies. |
| ToxCast Database | Toxicological Database | One of the largest public toxicity databases, providing high-throughput screening data for thousands of chemicals [34]. | A primary data source for training and validating predictive toxicology models. |
| RDKit | Cheminformatics Software | An open-source toolkit for cheminformatics, including computation of molecular descriptors and fingerprinting [36]. | Used for fundamental tasks like converting molecular formats, calculating descriptors, and handling chemical data. |
| BERT (Bidirectional Encoder Representations from Transformers) | Deep Learning Architecture | A transformer-based model pre-trained using a masked language modeling objective, adaptable to chemical languages [1]. | Serves as the backbone deep learning model for many state-of-the-art property prediction tasks. |
The choice between SMILES and SELFIES for molecular property prediction is not a simple binary decision. For supervised learning tasks like classification (e.g., predicting HIV activity, toxicity, or BBB penetration), SMILES representation coupled with a chemically-aware tokenizer like Atom Pair Encoding (APE) currently delivers the highest predictive accuracy, as evidenced by its superior ROC-AUC scores. However, for generative tasks such as the AI-driven design of new anti-HIV candidates [37] or de novo molecular optimization, the inherent robustness of SELFIES makes it a more reliable choice, as it guarantees that all generated outputs are valid molecules. Therefore, researchers should align their choice of molecular representation with their specific project goals: SMILES with advanced tokenization for maximum discriminative power, and SELFIES for robust and efficient exploration of chemical space in generative applications.
The choice of molecular representation is a foundational element in the application of artificial intelligence to de novo drug design, directly influencing a model's ability to generate valid, novel, and optimized chemical structures [1] [38]. Simplified Molecular-Input Line-Entry System (SMILES) has long been the standard string-based representation, but its susceptibility to generating invalid outputs has prompted the development of more robust alternatives [5]. Self-Referencing Embedded Strings (SELFIES) was introduced to guarantee 100% molecular validity through a grammar-based approach that ensures every string corresponds to a syntactically and semantically correct molecule [16] [5]. This guide provides an objective comparison of the performance of SMILES, SELFIES, and emerging fragment-based representations across key generative tasks, synthesizing current experimental data to inform researchers and drug development professionals.
The following tables summarize key performance metrics from recent studies evaluating SMILES, SELFIES, and other representations in generative tasks.
Table 1: Performance Metrics in Generative Modeling Tasks
| Representation | Theoretical Validity | Novelty (vs. Training Set) | Uniqueness (in Generated Set) | Diversity (FCD Score) | Key Strengths |
|---|---|---|---|---|---|
| SMILES | ~85-96% [39] | High (>90% in STONED) [16] | Varies by model [40] | Moderate | High interpretability, established use |
| SELFIES | ~100% [16] [5] | High (>90% in STONED) [16] | High with GA [16] | Moderate to High [41] | Guaranteed validity, enables new algorithms |
| t-SMILES (Fragment) | ~100% [38] | Higher than SMILES/SELFIES [38] | High [38] | Higher than SMILES/SELFIES [38] | Multiscale representation, reduces overfitting |
Table 2: Performance in Goal-Directed Optimization (e.g., DRD2, JNK3, GSK3β Activity)
| Representation | Model Type | Diverse Hits (#Circles Metric) | Sample Efficiency | Notable Findings |
|---|---|---|---|---|
| SMILES | LSTM (REINVENT) | High [41] | High [41] | Best overall performance in constrained benchmarks [41] |
| SELFIES | Genetic Algorithm (STONED) | Moderate [41] | Moderate [41] | Robustness allows simple algorithms to perform well [16] [41] |
| Graph | GraphGA | Lower [41] | Lower [41] | - |
| t-SMILES | Transformer | N/A | N/A | Significantly outperforms SOTA baselines in goal-directed tasks on ChEMBL [38] |
Standardized benchmarks like GuacaMol and MOSES are used to evaluate generative models [41] [39]. Key evaluation protocols include:
A representative study [40] provides a clear protocol for comparing representations:
The STONED (Genetic Algorithm) algorithm showcases a methodology leveraging SELFIES' robustness [16] [41]:
The following diagram illustrates a generalized workflow for de novo molecule design, integrating the choice of molecular representation and key validation steps.
Diagram 1: Generalized workflow for de novo molecule design with evaluation loops.
Table 3: Essential Tools for Experimentation in De Novo Molecular Design
| Tool / Resource | Type | Primary Function | Relevance to Representation Comparison |
|---|---|---|---|
| RDKit | Software Library | Cheminformatics and machine learning. | Calculates molecular properties (QED, SA), validates generated SMILES, and handles standardization [40] [39]. |
| ChEMBL | Database | Curated database of bioactive molecules. | Primary source of high-quality training and benchmarking data for drug discovery projects [40] [42]. |
| SELFIES Python Package | Library | Encoder/decoder for SELFIES. | Converts SMILES to SELFIES and back, enabling direct experimentation with the representation [16]. |
| t-SMILES Framework | Algorithmic Framework | Generates fragment-based molecular strings. | Provides a multi-code system for robust, multiscale molecular representation [38]. |
| GuacaMol / MOSES | Benchmarking Suite | Standardized frameworks for evaluating generative models. | Provides metrics and baselines for fair comparison of models using different representations [41]. |
| DRAGONFLY | Deep Learning Model | Interactome-based de novo design. | Example of a model that uses SMILES and integrates both ligand and 3D protein structure information [42]. |
| STONED Algorithm | Genetic Algorithm | Efficient combinatorial exploration of chemical space. | Showcases the use of SELFIES in a non-deep-learning approach, leveraging its robustness for random mutations [16] [41]. |
The experimental data indicates that no single molecular representation is universally superior across all tasks. SMILES remains a strong candidate, particularly in autoregressive models like LSTMs, due to its established use and high performance in goal-directed optimization [41]. However, its primary drawback is the non-negligible rate of invalid generation, which can waste computational resources [39].
SELFIES' key advantage is its guaranteed 100% validity, which simplifies model architectures and enables powerful new algorithms like high-mutation-rate genetic algorithms (STONED) and denser latent spaces in VAEs [16]. This robustness makes it exceptionally valuable for applications where validity is paramount and for exploring chemical space more aggressively.
Emerging fragment-based representations like t-SMILES show significant promise by offering a multiscale approach that captures higher-order structural information [38]. They demonstrate superior performance in many benchmarks, including the ability to avoid overfitting on low-resource datasets and achieving higher novelty while maintaining property distributions.
In conclusion, the selection of a molecular representation should be guided by the specific task. For maximum performance in goal-directed optimization with deep learning, SMILES-based models are highly competitive. For applications requiring maximum robustness, exploration, or simplicity of implementation, SELFIES is an excellent choice. Fragment-based representations represent the cutting edge, offering a powerful and flexible paradigm for the next generation of generative models in drug discovery.
Generative deep learning has revolutionized de novo drug design, enabling the on-demand generation of molecules with tailored properties. The success of these models heavily relies on the molecular representation used. String-based notations, particularly the Simplified Molecular Input Line Entry System (SMILES), have played a pivotal role due to their ability to encode molecular graphs as text sequences [43]. However, SMILES and other atom-level representations like SELFIES (SELF-referencing Embedded Strings) have inherent limitations. They often distribute information about chemical fragments across the string, can struggle with capturing chirality effectively, and may generate molecules with challenging synthetic accessibility [43] [13].
To address these challenges, the field is undergoing a paradigm shift from atom-level to fragment-based representations, mirroring the evolution in natural language processing from character- to word-level models [43]. This guide focuses on two advanced fragment-based representations: fragSMILES and Group SELFIES. We will objectively compare their performance against traditional methods and each other, providing the experimental data and protocols needed for informed adoption in computational drug discovery.
The fragSMILES algorithm constructs a fragment-based string through a three-phase process of disassembling, graph reduction, and string conversion [43] [44]:
<index>) to track connections, and chirality is specified with suffix tags (e.g., <2R>) [43].This process creates a "chemical-word"-level representation where each token corresponds to a meaningful chemical building block.
Group SELFIES builds upon the SELFIES framework, which is guaranteed to generate 100% valid molecules by design. Its key innovation is the introduction of group tokens that represent entire functional groups or substructures, thereby embedding chemical inductive biases directly into the representation [45].
While the exact fragmentation methodology is less detailed in the available literature compared to fragSMILES, Group SELFIES maintains the chemical robustness guarantees of its parent SELFIES. Any string generated, even when using group tokens, will always correspond to a chemically valid molecule. This hybrid approach allows the representation to leverage the benefits of both fragment-based semantics and guaranteed validity [45].
In de novo design, the goal is to train a model that can generate novel, valid, and diverse molecules that mirror the chemical and property space of the training data. Experiments comparing representations on the ZINC-250k dataset reveal significant differences in encoding compactness, which impacts model learning.
Table 1: Encoding Compactness on ZINC-250k
| Representation | Average Token Length | Vocabulary Size |
|---|---|---|
| SMILES | 44 | Not Specified |
| SELFIES | 37 | Not Specified |
| Group SELFIES | 30 | Not Specified |
| fragSMILES | 17 | 5,869 |
Data from [43] shows that fragSMILES achieves the most compact encoding, with an average sequence length less than half that of SMILES. This compactness, coupled with a controlled vocabulary size, simplifies the learning task for language models.
Further studies trained Recurrent Neural Networks (RNNs) on datasets from ChEMBL and ZINC, evaluating the quality of the generated molecules. fragSMILES demonstrated a strong ability to explore chemical space and generate molecules with desirable scaffold properties [43].
Chemical reaction prediction is a stringent test for a representation's ability to capture chemically meaningful transformations. A systematic study using the USPTO database (over 1 million reactions) and Transformer models provides clear performance metrics for forward and retro-synthesis [44].
Table 2: Reaction Prediction Accuracy on USPTO Dataset (Top-1, %)
| Representation | Forward Prediction Validity | Forward Prediction Accuracy | Retrosynthesis Validity | Retrosynthesis Accuracy |
|---|---|---|---|---|
| SMILES | 96.3% | 49.9% | 41.7% | Not Specified |
| SELFIES | 96.4% | 21.0% | 79.7% | Not Specified |
| SAFE | 92.8% | 30.2% | 43.6% | Not Specified |
| t-SMILES | 100.0% | 6.1% | 73.3% | Not Specified |
| fragSMILES | 97.3% | 53.4% | 55.8% | Not Specified |
fragSMILES achieved the highest top-1 accuracy for forward reaction prediction, significantly outperforming SMILES and other representations. It also maintained high validity rates, second only to t-SMILES. The study highlighted fragSMILES's superior capacity to handle reactions involving stereocenters, a critical aspect of synthesis planning [44].
A key advertised advantage of fragSMILES is its explicit and consistent annotation of chiral centers, overcoming a known limitation of SMILES strings [43]. In the fragSMILES string, chirality is indicated as a suffix tag on connector atom indices (e.g., <2R>) or as a special suffix for non-connector atoms. This ensures that stereochemical information is preserved unambiguously, regardless of the graph traversal order used to generate the string [43].
Both fragSMILES and Group SELFIES, by virtue of being fragment-based, implicitly bias the generation process towards molecules built from common chemical building blocks. This inherently improves the synthetic accessibility of the generated molecules compared to those from atom-level models, which can produce chemically awkward or unstable structures [43].
The following workflow, derived from the official fragSMILES GitHub repository, outlines the steps for reproducing key experiments [46]:
chemicalgof package or similar functions to disassemble each molecule into its fragSMILES representation. The default cleavage rule targets exocyclic single bonds.To benchmark representations for reaction prediction [44]:
Table 3: Key Computational Tools and Datasets
| Item Name | Type | Function in Research |
|---|---|---|
| ZINC-250k | Dataset | A curated database of ~250,000 drug-like molecules, commonly used for benchmarking generative models and representation learning [43]. |
| ChEMBL | Dataset | A large-scale database of bioactive molecules with drug-like properties, used for training models in a practical drug discovery context [47] [46]. |
| USPTO | Dataset | Contains over 1 million chemical reaction patents, serving as the standard benchmark for forward and retrosynthesis prediction tasks [44]. |
| Transformer Architecture | Model | The de facto standard neural network architecture for sequence-to-sequence tasks like reaction prediction and molecular generation [44]. |
| Word-RNN | Model | A Recurrent Neural Network variant adapted to process "word-level" (fragment-based) tokens, used for training generative models on fragSMILES [46]. |
| chemicalgof | Software | A Python package used for the molecular decomposition and graph reduction steps required to generate fragSMILES strings [46]. |
| Group SELFIES Library | Software | The official open-source implementation of Group SELFIES, used to create and manipulate molecules with group tokens [45]. |
The experimental data clearly differentiates the strengths of fragSMILES and Group SELFIES. fragSMILES excels in tasks requiring high predictive accuracy and explicit stereochemical control, such as reaction prediction and chiral-aware molecule generation. Its compact, interpretable representation makes it a powerful tool for complex biochemical design tasks [43] [44].
Group SELFIES, on the other hand, provides a compelling option for applications where guaranteed molecular validity is the highest priority. Its foundation on the robust SELFIES syntax ensures that every generated string is chemically valid, reducing the need for post-processing and making it particularly suitable for autonomous molecular discovery pipelines [45].
The choice between these representations is not a matter of which is universally better, but which is more appropriate for the specific research goal. For synthesis-oriented projects where chiral correctness is paramount, fragSMILES holds an edge. For robust, high-throughput generation of valid molecules, Group SELFIES is a strong candidate. As the field progresses, hybrid approaches that leverage the strengths of multiple representations may offer the most powerful path forward for generative drug discovery.
In computational chemistry and drug discovery, the representation of molecules as machine-readable strings is foundational for applying deep learning methodologies. The non-uniqueness problem—whereby a single molecule can be represented by multiple valid string sequences—presents a significant challenge for chemical language models (CLMs). This phenomenon is particularly prevalent in the Simplified Molecular Input Line Entry System (SMILES), where the same molecular structure can yield different string representations depending on the starting atom and the traversal path of the molecular graph [47]. For example, a simple molecule like benzene can have numerous valid SMILES strings such as "c1ccccc1", "C1=CC=CC=C1", or "c1cccc c1" [1]. This lack of bijective mapping can impair model performance by treating identical molecules as distinct entities, thereby confusing the learning process.
To address this challenge, researchers have developed two complementary approaches: canonicalization and augmentation. Canonicalization provides a deterministic method to generate a single, unique representation for each molecule, thereby enforcing consistency. Conversely, augmentation deliberately leverages non-uniqueness to artificially expand training datasets, exposing models to the varied representations they might encounter in real-world applications. A more recent development, the SELF-referencing Embedded String (SELFIES) representation, offers a fundamentally different approach by guaranteeing that every possible string corresponds to a valid molecule, though it does not fully resolve the non-uniqueness issue [48]. This guide provides a comprehensive comparison of these strategies, supported by experimental data, to inform researchers and drug development professionals in selecting appropriate methodologies for their specific applications.
The core of the non-uniqueness problem lies in the fundamental design of the molecular representations themselves. SMILES and SELFIES take philosophically different approaches to encoding chemical structures, each with distinct implications for machine learning.
SMILES representations, while human-readable and widely adopted, are inherently non-unique and can produce semantically invalid strings when generated by models [1]. A SMILES string is generated by performing a depth-first traversal of the molecular graph, leading to multiple valid representations for the same compound. Furthermore, standard SMILES do not inherently enforce chemical validity rules, meaning that small changes or model errors can produce strings that correspond to impossible molecules, such as a carbon atom with five bonds [48].
SELFIES was designed specifically to address these limitations for machine learning applications. Its core innovation is a formal grammar based on a finite state machine that tracks available valence bonds during the string interpretation process [48]. This design guarantees that every possible SELFIES string, without exception, represents a syntactically and chemically valid molecule. In experimental validations, when SMILES and SELFIES representations of MDMA were subjected to random mutations, SMILES strings rapidly degraded to only 26.6% validity after just one mutation, while SELFIES maintained 100% validity across mutations [48]. However, it is crucial to note that SELFIES does not solve the non-uniqueness problem; a single molecule can still have multiple valid SELFIES representations. Its primary contribution is ensuring robustness against invalid structure generation.
Table 1: Fundamental Comparison of SMILES and SELFIES Representations
| Feature | SMILES | SELFIES |
|---|---|---|
| Uniqueness | Non-unique; multiple representations per molecule | Non-unique; multiple representations per molecule |
| Validity Guarantee | No - models can generate invalid strings | Yes - every string is chemically valid |
| Primary Strength | Human-readable, widespread adoption | Robustness for generative models |
| Primary Limitation | Invalid generation, stereochemistry challenges | More complex tokenization, newer ecosystem |
| Tokenization | Atoms, bonds, branches, rings | Atoms, bonds, and specialized derivation rules (e.g., [Branch1], [Ring1]) |
Canonicalization addresses non-uniqueness by establishing a deterministic algorithm that produces a single, canonical representation for each molecular structure. This approach provides consistency but may sacrifice the representational diversity that can benefit model training.
The canonicalization process typically employs algorithms that assign a unique ordering to atoms within a molecule, often implemented through tools like RDKit. The process begins with hydrogen-suppressed molecular graphs where hydrogens are implicitly represented. The algorithm then assigns canonical labels to each atom based on invariant properties such as atomic number, degree, and aromaticity, breaking ties through iterative refinement until a unique ordering is achieved [49]. The SMILES string is generated by traversing the graph according to this canonical atom ordering, ensuring that the same molecule always produces the identical string representation. For SELFIES, while a formal canonicalization method isn't as established, conversion typically occurs through an intermediate canonical SMILES string to ensure consistency [48].
Recent systematic evaluations demonstrate that canonicalization plays a valuable role in training efficiency and model interpretability. Studies have shown that while non-canonical SMILES can provide benefits through inherent augmentation in certain scenarios, canonicalization is recommended when computational resources are limited, as it improves training efficiency without substantially sacrificing performance [49].
In probing experiments that analyzed the latent spaces of chemical language models, configurations using canonical SMILES with atomwise tokenization produced more chemically structured embeddings, suggesting a deeper internalization of chemical context [49]. This organization made the embeddings more interpretable and semantically meaningful, as evidenced by better performance in molecular property prediction tasks. For standard prediction tasks where robustness against diverse string representations is not critical, canonical SMILES input provides a practical and reliable setup.
Augmentation takes a contrasting approach by deliberately leveraging non-uniqueness to create multiple representations of the same molecule, thereby artificially expanding training datasets and encouraging models to learn invariant features.
The most established augmentation technique is SMILES enumeration (or randomization), wherein multiple valid SMILES strings are generated for the same molecule by varying the starting atom and traversal direction [47]. This approach has demonstrated significant benefits for the quality of de novo molecule design, particularly in low-data scenarios where it helps prevent overfitting and improves generalizability [47]. Beyond generative tasks, SMILES enumeration has improved model quality in diverse applications including organic synthesis planning, bioactivity prediction, and supramolecular chemistry [47].
Recent research has introduced more sophisticated augmentation techniques that move beyond simple enumeration:
Table 2: Performance of Augmentation Strategies Across Dataset Sizes [47]
| Augmentation Method | Optimal Probability (p) | Validity on Small Datasets (<2500 molecules) | Validity on Large Datasets (>7500 molecules) | Notable Strength |
|---|---|---|---|---|
| SMILES Enumeration (Baseline) | N/A | High | High | Reliable performance across scenarios |
| Token Deletion | 0.05 | Moderate | Declining with size | Creates novel scaffolds |
| Atom Masking | 0.05 | High | High | Learns properties in low-data regimes |
| Bioisosteric Substitution | 0.15 | Moderate | High | Incorporates medicinal chemistry knowledge |
| Self-Training | N/A | High | High | Best overall validity vs. enumeration |
The evaluation of these augmentation strategies follows a systematic methodology to assess their impact on model performance across various conditions. In a typical experimental setup, researchers train chemical language models (often recurrent neural networks with Long Short-Term Memory units) using augmented datasets while varying key parameters [47]: the probability of perturbation (p) at different levels (0.05, 0.15, 0.30), the augmentation fold (1-, 3-, 5-, and 10-fold expansion beyond original dataset size), and training set size (ranging from 1000 to 10,000 molecules from sources like ChEMBL). Models are then evaluated on their ability to learn chemical syntax through metrics including validity (percentage of generated SMILES that represent chemically valid molecules), uniqueness (percentage of non-duplicated molecules), and novelty (percentage of generated molecules not present in the training set) [47].
The choice between canonicalization and augmentation strategies involves trade-offs that become apparent when evaluating performance across different computational chemistry tasks.
In generative tasks for de novo molecular design, augmentation strategies generally outperform canonicalization approaches. Experimental results demonstrate that most augmentation methods achieve higher validity rates compared to non-augmented baselines, with the beneficial effects being more pronounced at higher augmentation folds and with smaller training set sizes [47]. Notably, self-training augmentation consistently performs better than enumeration across all dataset sizes, while atom masking shows particular promise for learning desirable physicochemical properties in very low-data regimes [47]. Token deletion, while sometimes producing lower validity rates, demonstrates unique strengths in fostering the creation of novel molecular scaffolds.
For molecular property prediction tasks, the comparative landscape is more nuanced. A 2025 study systematically evaluated how design choices—including representation format (SMILES vs. SELFIES) and tokenization strategy—affect performance on MoleculeNet benchmarks [49]. While downstream task performance was often similar across configurations, substantial differences emerged in the structure and interpretability of internal representations. The study found that a RoBERTa-based model with canonical SMILES input and atomwise tokenization provides a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it [49].
Table 3: Performance Comparison on Molecular Property Prediction Tasks (ROC-AUC) [49]
| Model Configuration | BBBP | BACE | HIV | Tox21 | ClinTox |
|---|---|---|---|---|---|
| RoBERTa + Canonical SMILES | 0.901 | 0.832 | 0.779 | 0.801 | 0.913 |
| RoBERTa + SELFIES | 0.892 | 0.829 | 0.768 | 0.794 | 0.902 |
| BART + Canonical SMILES | 0.895 | 0.828 | 0.772 | 0.798 | 0.908 |
| BART + SELFIES | 0.887 | 0.821 | 0.763 | 0.789 | 0.897 |
A critical aspect of addressing non-uniqueness is developing models that recognize chemically equivalent representations as identical. The Augmented Molecular Retrieval (AMORE) framework was specifically designed to evaluate this capability by measuring how stable a model's internal representations are across different SMILES variants of the same molecule [4]. Experiments using AMORE revealed that existing chemical language models often lack robustness, with their embedding spaces significantly altered by SMILES augmentations that should be invariant transformations [4]. This finding underscores that standard NLP metrics are insufficient for chemical tasks and that targeted approaches are needed to ensure models learn true molecular invariance rather than superficial text patterns.
Implementing these strategies requires specific computational tools and resources. The following table catalogues essential "research reagents" for working with molecular representations in machine learning applications.
Table 4: Essential Research Reagents for Molecular Representation Research
| Tool/Resource | Type | Primary Function | Application Context |
|---|---|---|---|
| RDKit | Software Library | Cheminformatics and molecule manipulation | Canonicalization, descriptor calculation, format conversion |
| SELFIES Python Library | Software Library | Conversion and manipulation of SELFIES strings | Generating guaranteed-valid molecular representations |
| SwissBioisostere Database | Database | Bioisosteric replacement patterns | Bioisosteric substitution augmentation |
| PubChem | Database | Large-scale molecular structures and properties | Source of training data and benchmarking compounds |
| ChemBERTa | Pre-trained Model | Transformer-based molecular representation | Baseline model for property prediction tasks |
| ChEMBL | Database | Bioactive molecules with drug-like properties | Training data for drug discovery applications |
| MoleculeNet | Benchmark Suite | Standardized tasks for molecular property prediction | Model evaluation and comparison |
| Hugging Face Transformers | Software Library | NLP model implementations and tokenization | Building and training chemical language models |
The non-uniqueness problem in molecular representations presents both a challenge and an opportunity for computational chemistry and drug discovery. Through systematic evaluation of canonicalization and augmentation strategies, several key recommendations emerge for practitioners:
For generative molecular design tasks, particularly in low-data scenarios, augmentation strategies provide significant benefits, with self-training and atom masking showing particular promise for improving validity and property learning [47]. For molecular property prediction tasks, canonical SMILES with atomwise tokenization offers a reliable and computationally efficient approach, producing well-structured latent spaces without sacrificing performance [49]. When model robustness and invariance are priorities, evaluation frameworks like AMORE should be employed to ensure models truly learn chemical equivalence rather than superficial text patterns [4].
The choice between SMILES and SELFIES involves fundamental trade-offs: SMILES offers maturity and widespread adoption, while SELFIES provides guaranteed validity that is particularly valuable for generative applications. Emerging approaches like domain-adaptive pretraining demonstrate that transformers pretrained on SMILES can be effectively adapted to SELFIES without expensive retraining, offering a practical middle ground [7].
As the field advances, the optimal solution to the non-uniqueness problem will likely involve context-aware strategies that combine the deterministic consistency of canonicalization with the robust generalization enabled by thoughtful augmentation, ultimately enabling more accurate and efficient exploration of the vast chemical space for drug discovery and materials science.
In the field of AI-driven drug discovery and materials science, how molecules are represented in a computer fundamentally shapes the effectiveness of generative models. Traditional representation methods often struggle with syntactic validity, where generated molecular strings fail to correspond to chemically plausible structures. This challenge has propelled the development of SELFIES (Self-Referencing Embedded Strings), a representation specifically designed to guarantee 100% syntactic and chemical validity. Unlike its predecessor SMILES (Simplified Molecular-Input Line-Entry System), which allows invalid structures with incorrect atom valencies or syntax errors, SELFIES incorporates formal grammar rules that ensure every possible string, even randomly generated ones, decodes to a molecule with chemically valid bonds and atoms [16]. This robustness makes SELFIES particularly valuable for generative applications, where maintaining structural validity during exploration of chemical space is paramount for discovering novel therapeutic compounds and functional materials.
The fundamental difference between SMILES and SELFIES lies in their underlying architecture and how they handle chemical constraints. SMILES, developed over 30 years ago, represents molecules as linear strings of ASCII characters, using symbols for atoms, bonds, and parentheses for branching. However, it lacks built-in mechanisms to enforce chemical validity, making it prone to generating impossible structures in AI models [16]. SELFIES addresses this limitation through a novel approach based on formal grammar and finite state automata. This design treats each SELFIES string as a small computer program with minimal memory, localizing non-local features like rings and branches and encoding physical constraints directly into the derivation state [16].
Table 1: Fundamental Differences Between SMILES and SELFIES
| Feature | SMILES | SELFIES |
|---|---|---|
| Validity Guarantee | No inherent guarantee | 100% robust - all strings valid [16] |
| Representation Basis | Atom chains with brackets for branches/rings [1] | Formal grammar (Chomsky type-2) [16] |
| Handling of Branches/Rings | Non-local indicators (parentheses, numbers) [16] | Localized length indicators [16] |
| Chemical Constraint Enforcement | None - can violate valency rules [1] | Built-in valency checks during derivation [16] |
| Human Readability | Moderate | Moderate (similar to SMILES) [16] |
The architectural advantages of SELFIES translate directly to superior performance in validity metrics. When subjected to random mutations—a simulation of operations common in generative models and genetic algorithms—SELFIES maintains nearly perfect validity while SMILES deteriorates significantly. In one striking experiment, random mutations applied to the MDMA molecule (ecstasy) demonstrated that SELFIES consistently produced valid molecular structures, whereas SMILES frequently generated invalid strings that failed to correspond to chemically plausible molecules [16]. This inherent robustness stems from SELFIES' ability to dynamically adjust bond types and ignore impossible connections based on available valency at each step in the decoding process, effectively preventing physically impossible configurations like F=O=F where atoms would exceed their natural bonding capacities [16].
Rigorous benchmarking across multiple research studies has demonstrated clear performance differences between SMILES and SELFIES across various applications. The following table synthesizes experimental results from validity, generative performance, and specific application contexts:
Table 2: Experimental Performance Comparison of SMILES vs. SELFIES
| Experiment/Context | SMILES Performance | SELFIES Performance | Notes |
|---|---|---|---|
| Random String Validity | Very low validity rate [16] | 100% validity even for random strings [16] | Critical for generative exploration |
| Genetic Algorithm Applications | Requires sophisticated hand-crafted mutation rules [16] | Enables arbitrary random modifications as mutations [16] | Simplifies algorithm design |
| Chemical Image Recognition | Best overall performance in translation accuracy [50] | Guarantees valid chemical structures [50] | SMILES more accurate, SELFIES more robust |
| Large Language Model Generation | Higher BLEU scores but validity challenges [51] | Lower BLEU scores but perfect validity [51] | Trade-off between fluency and correctness |
| Scaffold Hopping | Limited by validity constraints [3] | Enables broader exploration of chemical space [3] | Better for discovering novel scaffolds |
A 2025 study investigated whether a SMILES-pretrained transformer (ChemBERTa-zinc-base-v1) could be adapted to SELFIES using domain-adaptive pretraining without architectural changes [7]. The methodology involved:
The domain-adapted model outperformed the original SMILES baseline and slightly surpassed ChemBERTa-77M-MLM across most targets, despite a 100-fold difference in pretraining scale [7]. This demonstrates that SELFIES adaptation offers a cost-efficient alternative for molecular property prediction, making robust representation accessible to groups with limited computational resources.
The following diagram illustrates a typical experimental workflow for implementing SELFIES in generative molecular design:
Diagram 1: SELFIES Generative Model Workflow
Implementing SELFIES in research workflows begins with basic conversion between molecular representations. The Python selfies library provides straightforward tools for this purpose:
This interoperability allows researchers to maintain existing SMILES-based pipelines while leveraging SELFIES' robustness advantages for generation tasks [16].
A significant extension of the SELFIES framework is Group SELFIES, which introduces tokens representing functional groups or entire substructures while maintaining the original robustness guarantees [52]. This approach bridges the gap between atomic representations and the way human chemists typically conceptualize molecules—in terms of meaningful substructures and functional groups. In Group SELFIES, tokens can represent common chemical motifs like carboxyl groups or phenyl rings, creating a more compact and chemically intuitive representation [52].
Table 3: Group SELFIES Advantages Over Standard SELFIES
| Feature | Standard SELFIES | Group SELFIES |
|---|---|---|
| Representation Level | Atomic [52] | Fragment-based [52] |
| Chemical Interpretability | Moderate | High (human-like thinking) [52] |
| String Length | Longer atomic sequences | Shorter through group tokens [52] |
| Extended Chirality Support | Limited | Yes, through chiral group tokens [52] |
| Distribution Learning | Baseline | Improved [52] |
Experimental results demonstrate that Group SELFIES improves distribution learning of common molecular datasets and enhances the quality of molecules generated through random sampling compared to regular SELFIES strings [52]. This suggests that incorporating chemical prior knowledge at the fragment level creates a more efficient search space for generative models to explore.
Addressing the practical challenge that large language models (LLMs) currently perform worse with SELFIES than SMILES—likely due to SMILES' longer presence in training data—researchers developed SmiSelf, a cross-chemical language framework [51]. This approach leverages the strengths of both representations:
This hybrid approach guarantees 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics, effectively expanding LLMs' practical applications in biomedicine [51].
Successful implementation of SELFIES in research workflows requires specific computational tools and resources:
Table 4: Essential Research Reagents and Tools for SELFIES Experiments
| Tool/Resource | Type | Function | Access |
|---|---|---|---|
| selfies Python Library | Software Package | Core conversion between SMILES and SELFIES [16] | GitHub/PyPI |
| RDKit | Cheminformatics Toolkit | Fundamental cheminformatics operations and validation | Open Source |
| Transformer Models (BERT-based) | Architecture | Chemical language model implementation [1] | Hugging Face |
| PubChem | Database | Source of molecular structures for training [7] | Public Database |
| MoleculeNet Benchmarks | Evaluation Suite | Standardized performance assessment [7] | Public Repository |
| Group SELFIES Extension | Specialized Package | Fragment-based SELFIES implementation [52] | GitHub |
SELFIES represents a fundamental advancement in molecular representation for generative AI applications, directly addressing the critical challenge of syntactic validity that plagues traditional SMILES-based approaches. Through its foundation in formal grammar and embedded chemical constraints, SELFIES guarantees 100% validity of generated molecules—a crucial feature for automated drug discovery and materials design pipelines. While SMILES may retain advantages in specific applications like chemical image recognition and remains more familiar to many existing AI systems, SELFIES' robustness makes it uniquely suited for generative tasks where exploring uncharted chemical space is essential.
The continuing evolution of SELFIES—through extensions like Group SELFIES for fragment-based design and hybrid frameworks like SmiSelf for LLM compatibility—demonstrates the vibrant innovation in this field. As these representations mature and become more integrated into mainstream chemical AI tools, they promise to significantly accelerate the discovery of novel therapeutic compounds and functional materials by providing a more reliable bridge between computational exploration and chemically plausible reality.
In computational chemistry and drug discovery, the ability of machine learning models to correctly identify the same molecule across different textual representations is a fundamental test of their true chemical understanding. A model that produces vastly different embeddings or predictions for the same molecule represented by different valid strings is not robust, which can lead to unreliable outcomes in critical applications like drug property prediction. The Augmented Molecular Retrieval (AMORE) framework has emerged as a novel, zero-shot method to quantitatively assess this robustness by testing whether chemically identical molecular representations are treated similarly in a model's embedding space [4]. This guide provides a comparative analysis of AMORE's application to the two predominant molecular string representations: SMILES (Simplified Molecular Input Line Entry System) and SELFIES (Self-Referencing Embedded Strings).
The choice of molecular representation fundamentally influences how a language model processes and learns from chemical data.
SMILES: A line notation that encodes molecular structures using ASCII strings. Atoms are represented by their chemical symbols, bonds are implied or denoted with specific characters, and branches or rings are indicated using parentheses and numbers. A single molecule can have multiple valid SMILES strings due to factors like different starting atoms or branch ordering [4] [1]. For example, benzene can be represented as c1ccccc1 or C1=CC=CC=C1.
SELFIES: A more robust representation designed specifically for machine learning applications. SELFIES is based on a formal grammar that guarantees 100% syntactic validity, meaning every possible string corresponds to a valid molecule. It simplifies complex spatial features like rings and branches into single symbols with explicitly encoded sizes, making it less prone to the invalid outputs that sometimes plague SMILES-based generative models [1] [2].
The table below summarizes their key characteristics:
Table 1: Fundamental Comparison of SMILES and SELFIES
| Feature | SMILES | SELFIES |
|---|---|---|
| Primary Strength | Human-readable, widespread adoption [1] | Guaranteed syntactic validity [1] [2] |
| Validity Guarantee | No - can generate invalid structures [1] | Yes - every string is valid [1] |
| Representation of Rings/Branches | Complex grammar [2] | Single symbols with explicit length encoding [2] |
| Number of Valid Strings per Molecule | Multiple [4] | Multiple |
Chemical language models (ChemLMs) are often trained on large datasets of molecular strings. A significant challenge arises because these models can overfit to the specific textual patterns in their training data rather than learning the underlying chemical principles. For instance, a model might fail to recognize that two different SMILES strings represent the same molecule, treating them as distinct entities [4] [53]. This lack of robustness—the model's stability against permissible variations in input—undermines its reliability. Standard natural language processing (NLP) metrics like BLEU or ROUGE are insufficient for detecting this problem, as they focus on textual overlap rather than chemical equivalence [4].
The AMORE framework is designed to probe the robustness of a ChemLM's internal representations directly. Its core hypothesis is that different string representations of the same molecule should yield similar embeddings in the model's latent space. If these embeddings are distant, it indicates that the model is sensitive to semantically meaningless syntactic variations [4].
The experimental protocol for AMORE involves several key steps [4]:
Diagram 1: AMORE Framework Workflow. This diagram illustrates the core steps of the AMORE protocol, from molecular input to robustness evaluation.
Experiments using the AMORE framework have revealed that many state-of-the-art ChemLLMs are not robust to different SMILES representations. The embeddings of the same molecule under different SMILES variations can be surprisingly distant, and in many cases, the nearest neighbor to an augmented SMILES might be the embedding of a completely different molecule [4]. This demonstrates that these models have not learned an invariant representation of molecular structure and are often misled by superficial textual differences.
While AMORE was initially applied to SMILES-based models, its principles are directly applicable to any string-based representation, including SELFIES. The search for more robust representations has led researchers to compare the performance of models using SMILES and SELFIES, often in conjunction with different tokenization strategies.
The table below summarizes experimental findings from comparative studies:
Table 2: Comparative Performance of SMILES and SELFIES in Model Tasks
| Representation | Tokenization | Task/Dataset | Performance Metric | Result | Notes |
|---|---|---|---|---|---|
| SMILES | Atom Pair Encoding (APE) | Biophysics/Physiology Classification [1] | ROC-AUC | Significant Improvement over BPE | APE preserves chemical integrity [1] |
| SELFIES | (Classical Models) | Molecular Property Prediction (SIDER) [2] | ROC-AUC | Baseline Performance | - |
| Augmented SELFIES | (Classical Models) | Molecular Property Prediction (SIDER) [2] | ROC-AUC | +5.97% Improvement over SMILES | Data augmentation enhances learning [2] |
| Augmented SELFIES | QK-LSTM (Hybrid Quantum-Classical) [2] | Molecular Property Prediction (SIDER) [2] | ROC-AUC | +5.91% Improvement over SMILES | Effective in hybrid quantum-classical models [2] |
The data indicates that the choice of representation and its processing pipeline significantly impacts model performance.
Implementing and evaluating molecular representations and robustness frameworks requires a suite of standardized tools and datasets.
Table 3: Key Research Reagents and Computational Tools
| Item Name | Function / Purpose | Relevance to Research |
|---|---|---|
| AMORE Framework | A zero-shot evaluation framework to assess the robustness of chemical language models [4]. | Core methodology for measuring model invariance to semantically equivalent input variations. |
| SELFIES (v2.0) | A molecular string representation that guarantees 100% syntactic validity [1] [2]. | Provides a robust alternative to SMILES for training and evaluating models. |
| MoleculeNet Benchmark | A standardized benchmark suite containing multiple datasets for molecular property prediction [2]. | Enables fair and consistent evaluation of model performance across diverse chemical tasks. |
| SMILES Augmentation Tools | Software libraries (e.g., in RDKit) to generate valid, alternative SMILES strings for a given molecule [4]. | Essential for creating the augmented datasets required by the AMORE framework and for training data enhancement. |
| Atom Pair Encoding (APE) | A domain-specific tokenization method designed for chemical languages like SMILES and SELFIES [1]. | Improves model performance by creating tokens that maintain the structural meaning of molecules. |
| SIDER Dataset | A dataset containing information on marketed medicines and their recorded adverse drug reactions [2]. | A key benchmark dataset for predicting molecular properties like side effects. |
The rigorous evaluation of model robustness is as critical as the pursuit of high predictive accuracy. The AMORE framework provides a powerful, chemically-grounded method for this purpose, revealing that many modern ChemLMs are not invariant to different SMILES representations. Comparative studies show that SELFIES, particularly when used with augmentation strategies and modern tokenizers like APE, presents a strong path toward building more robust and reliable models. The future of molecular representation lies in developing formats and training paradigms that explicitly encode chemical equivalence, moving beyond superficial string matching to achieve a deeper, more robust understanding of molecular structure.
The accurate computational representation of molecules is a foundational challenge in modern drug discovery and materials science. The evolution from traditional string-based formats like SMILES (Simplified Molecular-Input Line-Entry System) to more robust alternatives like SELFIES (SELF-referencing Embedded Strings) has framed a critical research thesis: how effectively can different molecular representations preserve chemical validity and enable predictive accuracy when integrated with advanced AI paradigms? [3] [1]. This guide objectively compares the performance of modern machine learning techniques—specifically contrastive learning and prompt learning—that are engineered to incorporate chemical prior knowledge, thereby evaluating their success in leveraging SMILES, SELFIES, and graph-based representations for molecular property prediction.
At the core of AI-driven chemistry lies the task of translating molecular structures into a computer-readable format. The choice of representation fundamentally influences a model's ability to learn accurate structure-property relationships [3].
Contrastive learning is a self-supervised paradigm where models learn representations by comparing data points. The core objective is to minimize the distance between embeddings of similar molecules ("positive pairs") while maximizing the distance between dissimilar ones ("negative pairs") [54] [55]. The key innovation in modern methods lies in how they define these pairs using chemical knowledge, moving beyond structure-agnostic augmentation.
1. MolFCL: Fragment-based Contrastive Learning MolFCL introduces chemical knowledge directly into the contrastive learning framework through molecular fragments and their reactions [55].
2. Knowledge-Guided Graph Augmentation Other approaches focus on creating meaningful augmented views for the contrastive learning pipeline by leveraging chemical similarity or substituting bioisosteres—atoms or molecular fragments with similar chemical or physical properties [55]. This ensures that the augmented view retains the semantic meaning and chemical validity of the original molecule, leading to more informative positive pairs.
The following diagram illustrates the general workflow for incorporating chemical prior knowledge into a contrastive learning framework, as exemplified by models like MolFCL.
Prompt learning adapts large pre-trained models to downstream tasks by incorporating task-specific instructions or "prompts" directly into the input, avoiding the cost of full model fine-tuning. In molecular AI, this involves embedding chemical knowledge—such as functional groups—into the prompt to guide the model's predictions [55] [56].
1. MolFCL's Functional Group Prompt Tuning In its fine-tuning phase, MolFCL employs a prompt learning strategy that integrates knowledge of functional groups—substructures of atoms that determine a molecule's characteristic chemical reactions [55].
2. MolFinePrompt: Fine-Grained Multimodal Prompting MolFinePrompt is a multimodal model that integrates molecular structures with textual descriptions [56].
Given the complementary strengths of different molecular representations, a prominent trend is to develop models that fuse multiple views.
MoL-MoE: Multi-view Mixture-of-Experts This framework integrates the latent spaces derived from SMILES, SELFIES, and molecular graphs to predict molecular properties [18].
The following tables summarize experimental data for the discussed approaches, providing a quantitative comparison of their performance on benchmark tasks.
Table 1: Performance Comparison of Models Using Chemical Prior Knowledge
| Model | Core Approach | Key Chemical Prior | Benchmark (Number of Datasets) | Reported Performance vs. SOTAs |
|---|---|---|---|---|
| MolFCL [55] | Fragment-based Contrastive Learning + Prompt Tuning | Molecular fragment reactions; Functional groups | MolecularNet & TDC (23) | Superior performance on all 23 datasets |
| MolFinePrompt [56] | Multimodal Pre-training + Knowledge-guided Prompts | Molecular substructures; Textual descriptions | Property prediction; Drug interaction | Superior performance on benchmark tasks |
| MoL-MoE [18] | Multi-view Mixture-of-Experts | SMILES, SELFIES, and Molecular Graphs | MoleculeNet (9) | Superior performance across all 9 datasets |
Table 2: Comparative Analysis of Molecular Representation Languages in Generative Models
| Representation | Primary Characteristic | Key Finding in Generative Context (Diffusion Model) [17] |
|---|---|---|
| SMILES | Compact, human-readable; can generate invalid structures | Excels in QEPPI and SA score metrics for generated molecules |
| SELFIES | 100% syntactically valid; robust for generation | High similarity to SMARTS; performs best on QED metric |
| SMARTS | Allows structural patterning | High similarity to SELFIES |
| IUPAC | Human-language-like, verbose | Highest novelty and diversity of generated molecules |
The following table details key computational tools and resources referenced in the featured research, essential for replicating and advancing work in this field.
Table 3: Key Research Reagents and Computational Resources
| Resource Name | Type | Primary Function in Research |
|---|---|---|
| ZINC15 [55] | Database | A large, publicly accessible database of commercially available chemical compounds, used for pre-training models on unlabeled molecular data. |
| MolecularNet [55] | Benchmark Suite | A standard benchmark collection for molecular machine learning, providing datasets for evaluating property prediction tasks. |
| TDC (Therapeutics Data Commons) [55] | Benchmark Suite | A platform providing numerous datasets and benchmarks across the entire drug discovery pipeline, including ADMET property prediction. |
| BRICS [55] | Algorithm | A method for decomposing molecules into retrosynthetically interesting chemical substructures, used to build fragment-based augmented views. |
| CMPNN [55] | Graph Neural Network | A type of graph encoder (Communicative Message Passing Neural Network) used to learn powerful representations from molecular graphs. |
| NT-Xent Loss [55] | Loss Function | The normalized temperature-scaled cross entropy loss, used as the objective function in many contrastive learning frameworks. |
The integration of chemical prior knowledge through contrastive and prompt learning marks a significant leap beyond treating molecular representations as mere strings or graphs. The experimental data consistently shows that models which explicitly incorporate chemical knowledge—be it fragments, functional groups, or multi-view representations—achieve state-of-the-art performance across diverse molecular prediction tasks [55] [56] [18].
In the broader thesis of evaluating SMILES versus SELFIES, the findings indicate that the "best" representation can be task-dependent. While SELFIES offers unparalleled robustness for generation [1], SMILES can still yield excellent results on specific predictive metrics [17]. Furthermore, models that avoid the choice entirely by fusing multiple views, like MoL-MoE, often achieve the strongest overall results [18]. This suggests that the future of molecular representation may not lie in a single, universal format, but in flexible, knowledge-informed models that can synthesize information from multiple complementary perspectives.
The accurate computational representation of molecules is a foundational challenge in modern drug discovery and materials science. Molecular representations serve as the critical bridge between chemical structures and the prediction of their biological activity, physicochemical properties, and ultimate therapeutic potential. Traditional representation methods, including string-based formats like Simplified Molecular Input Line Entry System (SMILES) and graph-based approaches, have enabled significant advances yet come with inherent limitations. SMILES notations, while compact and human-readable, can generate semantically invalid strings and struggle with consistent representation of complex chemical classes [1]. In response, SELF-referencing Embedded Strings (SELFIES) emerged, guaranteeing 100% validity by ensuring every string corresponds to a syntactically correct molecule [51]. Concurrently, graph-based representations provide a more intuitive depiction of molecular structure by representing atoms as nodes and bonds as edges [3].
Recent innovation has shifted from relying on a single representation type toward hybrid models that integrate multiple views of molecular data. These multi-modal approaches leverage the complementary strengths of various representations—such as the sequential patterns captured by language models and the structural relationships encoded by graph networks—to achieve superior predictive performance and robustness [18] [57]. This guide objectively compares the performance of these emerging combined approaches against traditional single-modality methods, providing researchers with experimentally validated insights for selecting optimal molecular representation strategies.
Rigorous evaluation across standardized benchmarks reveals distinct performance patterns for different molecular representation strategies. The integration of multiple representation types consistently outperforms single-modality approaches across diverse molecular prediction tasks.
Table 1: Performance Comparison of Representation Approaches on MoleculeNet Benchmarks (AUC-ROC Scores)
| Representation Approach | HIV | Tox21 | BBBP | Average Validity | Key Strengths |
|---|---|---|---|---|---|
| SMILES-Only (ChemBERTa) | 0.780 | 0.841 | 0.890 | ~89% | Computational efficiency, established baselines |
| SELFIES-Only | 0.772 | 0.835 | 0.885 | 100% | Guaranteed validity, robustness for generation |
| Graph-Only (GNN) | 0.801 | 0.852 | 0.901 | N/A (inherently structural) | Structural awareness, relational reasoning |
| Multi-View (MoL-MoE) | 0.823 | 0.868 | 0.918 | >95%* | Complementary feature learning, adaptability |
Note: Multi-view validity depends on constituent representations; combined approaches can leverage SELFIES for guaranteed validity when needed [18] [1] [51].
The data demonstrates that while single-modality approaches provide solid baseline performance, integrated multi-view frameworks consistently achieve superior results. The MoL-MoE (Multi-view Mixture-of-Experts) framework, which integrates SMILES, SELFIES, and graph-based representations, achieved state-of-the-art performance across all nine MoleculeNet benchmark datasets evaluated, with particularly strong showings in biophysics and physiology classification tasks [18]. This performance advantage stems from the model's ability to dynamically adjust its reliance on different molecular representations based on task-specific requirements, effectively leveraging the most informative features from each modality [18].
Tokenization strategy significantly influences the performance of language-based molecular representations. Recent comparative studies have evaluated different tokenization approaches for chemical language models.
Table 2: Tokenization Method Performance for Molecular Property Prediction
| Tokenization Method | Representation Format | ROC-AUC (HIV) | ROC-AUC (Tox21) | Structural Integrity | Implementation Complexity |
|---|---|---|---|---|---|
| Byte Pair Encoding (BPE) | SMILES | 0.780 | 0.841 | Moderate | Low |
| Byte Pair Encoding (BPE) | SELFIES | 0.772 | 0.835 | High | Low |
| Atom Pair Encoding (APE) | SMILES | 0.792 | 0.851 | High | Moderate |
| Atom Pair Encoding (APE) | SELFIES | 0.785 | 0.843 | High | Moderate |
The novel Atom Pair Encoding (APE) tokenizer, specifically designed for chemical languages, demonstrates notable performance advantages over traditional Byte Pair Encoding (BPE) by better preserving the contextual relationships and structural integrity of molecular representations [1]. When combined with SMILES representations, APE achieved the highest classification accuracy in molecular property prediction tasks, though it requires more specialized implementation than generic tokenization approaches [1].
The Multi-view Mixture-of-Experts (MoL-MoE) framework represents a sophisticated approach to integrating multiple molecular representations. Its architecture employs a gating network that selectively activates specialized expert sub-networks for each representation modality [18].
Multi-View Mixture of Experts Architecture
The MoL-MoE framework employs 12 expert networks organized into three modality groups (SMILES, SELFIES, and molecular graphs), with four experts dedicated to each representation type [18]. The gating network learns to route inputs through the most relevant experts based on the specific molecular characteristics and prediction task. During experimentation, two routing activation settings (k=4 and k=6) were evaluated, with the model demonstrating robust performance across both configurations [18]. Analysis of routing patterns revealed that the model dynamically adjusts its use of different molecular representations based on task-specific requirements, preferentially activating graph experts for structural property prediction and language-based experts for sequence-dependent tasks [18].
For drug-target interaction (DTI) prediction, the Hetero-KGraphDTI framework combines graph representation learning with knowledge-based regularization to achieve state-of-the-art performance [57].
Knowledge-Enhanced Graph Learning Workflow
This framework constructs a heterogeneous graph that integrates multiple data types, including chemical structures, protein sequences, and interaction networks [57]. The graph convolutional encoder employs a multi-layer message passing scheme that aggregates information from different edge and node types, while an attention mechanism learns to assign importance weights to edges based on their prediction relevance [57]. The knowledge integration component incorporates biological knowledge from Gene Ontology (GO) and DrugBank as regularization constraints, encouraging learned embeddings to maintain biological plausibility and improving model interpretability through attention weight visualization that identifies salient molecular substructures and protein motifs [57].
Table 3: Essential Research Tools for Multi-View Molecular Representation
| Tool/Category | Specific Examples | Function | Implementation Considerations |
|---|---|---|---|
| Molecular Representation Converters | RDKit [27], OpenBabel | Generates multiple representations from chemical structures | RDKit provides robust image generation for vision-based approaches |
| Foundation Models | CLIP [27], RoBERTa [58], GPT-series [51] | Pretrained backbones for transfer learning | Vision foundation models (CLIP) enable few-shot molecular image learning |
| Benchmark Datasets | MoleculeNet [18] [27], ChEMBL [27], DrugBank [57] | Standardized evaluation and pretraining | ChEMBL-25 provides 1.9M bioactive molecules for pretraining |
| Graph Neural Networks | GCNs [57], GATs [57], Message Passing Networks [57] | Structural relationship learning | Graph attention mechanisms improve interpretability |
| Tokenization Tools | Atom Pair Encoding [1], Byte Pair Encoding [1] | Text segmentation for language models | APE preserves chemical context better than BPE |
| Validation Frameworks | SMILES/SELFIES validators [51], Chemical Checker | Ensures molecular validity | SELFIES guarantees 100% validity for generated molecules |
The SmiSelf framework addresses a critical challenge in molecular generation: ensuring 100% validity of outputs while maintaining strong performance on other metrics [51].
SmiSelf Invalid SMILES Correction Workflow
SmiSelf operates as a cross-chemical language framework that converts invalid SMILES generated by large language models into SELFIES using grammatical rules, then transforms them back into SMILES, leveraging SELFIES' inherent robustness to ensure 100% validity [51]. Experimental results demonstrate that SmiSelf not only guarantees complete validity but also preserves molecular characteristics and maintains or even enhances performance on other key metrics, including structural similarity and property prediction accuracy [51]. This approach is particularly valuable for few-shot and zero-shot learning scenarios where LLMs struggle with the strict syntactic rules of SMILES notation [51].
The field of molecular representation continues to evolve rapidly, with several promising research directions emerging. Studies evaluating large language models have revealed statistically significant zero- and few-shot preferences for certain molecular representations, with InChI and IUPAC names surprisingly outperforming SMILES in some contexts, potentially due to their granularity, favorable tokenization, and prevalence in pretraining corpora [59]. This finding contradicts previous assumptions that SMILES should be the default representation for molecular property prediction tasks.
Another significant trend involves leveraging foundation models from computer vision and natural language processing as backbones for molecular representation learning. The MoleCLIP framework demonstrates that initializing molecular image encoders with weights from OpenAI's CLIP model significantly reduces the volume of molecular pretraining data required to match state-of-the-art performance [27]. This approach also enhances robustness to distribution shifts, enabling effective adaptation to specialized domains like homogeneous catalysis with limited task-specific data [27].
The experimental evidence consistently demonstrates that combining graph and language-based molecular representations achieves superior performance across diverse chemical informatics tasks. While SMILES remains the most suitable representation for molecule generation using current large language models [51], and SELFIES provides unparalleled validity guarantees, graph-based approaches offer indispensable structural awareness. The most robust solutions strategically integrate multiple representation types, often through mixture-of-experts architectures or knowledge-enhanced graph learning frameworks.
For researchers and drug development professionals, the optimal molecular representation strategy depends on specific task requirements, data availability, and validity constraints. For property prediction tasks, multi-view approaches consistently deliver state-of-the-art accuracy. For generative applications where validity is paramount, SELFIES-based approaches or correction frameworks like SmiSelf provide essential safeguards. As molecular representation research continues to advance, the strategic combination of complementary approaches will remain essential for extracting maximum predictive power from molecular data and accelerating drug discovery pipelines.
In the field of computational chemistry and drug discovery, the representation of molecules is a foundational element that directly influences the success of machine learning models. Simplified Molecular Input Line Entry System (SMILES) and Self-Referencing Embedded Strings (SELFIES) are two prominent string-based representations that enable researchers to treat molecules as sequences, similar to words in a sentence, for processing by natural language models [1]. The evaluation of these representations hinges on three core metrics: validity (does the string correspond to a chemically plausible molecule?), accuracy (how well does the representation predict molecular properties?), and robustness (does the model recognize different string representations of the same molecule?) [4] [60]. This guide provides an objective comparison of SMILES and SELFIES using recently published experimental data, offering researchers a framework for selecting appropriate representations for their specific applications.
Validity ensures that a molecular string corresponds to a syntactically correct and chemically plausible structure. This is particularly crucial for generative models in de novo molecular design.
A direct method for testing validity involves generating random strings from each representation's alphabet and measuring the success rate of decoding them into valid molecules [60].
[Branch1] and [Ring1].Table 1: Validity comparison between SELFIES and SMILES based on random string generation [60]
| Representation | Alphabet Size | Strings Tested | Validity Rate | Key Failure Modes |
|---|---|---|---|---|
| SELFIES | 69 symbols | 5/5 | 100% | None; guaranteed by formal grammar |
| SMILES | ~19 common characters | 0/20 | 0% | Unbalanced parentheses, incorrect ring closures, invalid valences |
The experimental data demonstrates a fundamental advantage of SELFIES: its underlying grammar guarantees that every possible string decodes to a valid molecule [60]. This 100% validity is inherent to its design, which uses single symbols to represent structural features like rings and branches, explicitly encoding their size. In contrast, SMILES requires strict syntactic correctness across multiple dimensions. Random SMILES strings fail due to unbalanced parentheses for branches, improper ring closure numbering, or chemically impossible atom valences [60]. This makes SELFIES distinctly superior for generative tasks, as it eliminates the problem of invalid molecule output.
Accuracy measures a representation's effectiveness in predicting molecular properties in downstream tasks, a key requirement for accelerating drug discovery.
A standard protocol for assessing accuracy involves training transformer models (e.g., BERT architectures) on different molecular representations and evaluating their performance on benchmark datasets [1] [7].
Table 2: Accuracy comparison (ROC-AUC) of SMILES and SELFIES on classification tasks [1]
| Representation | Tokenization | HIV Dataset | Toxicity Dataset | BBBP Dataset |
|---|---|---|---|---|
| SMILES | BPE | 0.782 | 0.856 | 0.901 |
| SMILES | APE | 0.816 | 0.882 | 0.932 |
| SELFIES | BPE | 0.791 | 0.861 | 0.914 |
| SELFIES | APE | 0.802 | 0.874 | 0.925 |
Table 3: Accuracy comparison (RMSE) on regression tasks [7] [2]
| Representation | Model | ESOL | FreeSolv | Lipophilicity |
|---|---|---|---|---|
| SMILES | ChemBERTa-77M-MLM | 0.688 | 1.890 | 0.650 |
| SELFIES | SELFormer | 0.612 | 1.750 | 0.598 |
| SELFIES | Domain-Adapted ChemBERTa | 0.944 | 2.511 | 0.746 |
Accuracy is highly dependent on both the representation and the tokenization strategy. The novel Atom Pair Encoding (APE) tokenizer, designed for chemical languages, consistently outperforms the more general BPE across both SMILES and SELFIES representations [1]. When comparing the representations directly, their performance is often comparable, with each winning in different scenarios. SELFIES-based models like SELFormer can achieve state-of-the-art results, sometimes surpassing larger models trained on SMILES, despite using a fraction of the pre-training data [7]. Furthermore, research shows that augmenting SELFIES (creating multiple valid representations for training) can lead to statistically significant improvements in accuracy, by 5.97% in classical models and 5.91% in hybrid quantum-classical models, compared to SMILES [2]. This suggests that the robustness of SELFIES can be leveraged to enhance predictive performance.
Robustness assesses whether a model recognizes that different string representations encode the same underlying molecule. This is a key indicator of true chemical understanding.
The Augmented Molecular Retrieval (AMORE) framework provides a method for evaluating robustness in a zero-shot manner without the need for expensive labeled data [4].
AMORE Framework Workflow for Evaluating Robustness
Experiments using AMORE have revealed that many state-of-the-art chemical language models are not robust to different SMILES representations of the same molecule [4]. The embedding similarities between a molecule and its augmented variants are often low, indicating that the models are overfitting to specific string patterns rather than learning the underlying chemical identity. This lack of robustness is a significant limitation of models trained primarily on SMILES. SELFIES, by design, offers a more canonical representation that can mitigate this issue, though its performance in this specific framework is an area of ongoing research.
Table 4: Key software and data resources for molecular representation research
| Resource Name | Type | Primary Function | Relevance to SMILES/SELFIES |
|---|---|---|---|
| RDKit | Cheminformatics Library | Molecule manipulation, descriptor calculation, and validation [60] | The primary tool for converting SMILES/SELFIES to molecular objects and checking validity. |
| selfies (Python library) | Specialized Library | Encodes and decodes SELFIES strings [60] | Essential for working with the SELFIES representation and its alphabet. |
| Hugging Face Transformers | NLP Library | Provides state-of-the-art transformer architectures like BERT [1] | Used to build and fine-tune chemical language models. |
| MoleculeNet | Benchmark Dataset Collection | Curated datasets for molecular machine learning [1] [2] | Standard benchmark for evaluating prediction accuracy (e.g., ESOL, BBBP, HIV). |
| PubChem | Chemical Database | Large repository of molecules and their properties [1] [7] | Source of millions of SMILES strings for large-scale pre-training of models. |
The choice between SMILES and SELFIES involves a strategic trade-off between convenience and robustness, which should be guided by the specific application.
In conclusion, while SMILES remains a powerful and widely used representation, SELFIES addresses several of its key weaknesses. The future of molecular representation may not be a choice between the two, but rather their intelligent integration, as seen in multi-view models that leverage the complementary strengths of both to achieve superior predictive performance [18].
In the field of computer-aided synthesis planning, deep learning has transformed how chemists approach organic synthesis. These models significantly reduce the time and resources required compared to traditional trial-and-error approaches [61]. A fundamental aspect of these models is how they represent molecules, as the choice of representation directly influences the accuracy of predicting reaction products (forward synthesis) or precursor molecules (retrosynthesis) [3]. Among the various options, string-based representations that leverage natural language processing techniques have gained significant traction [61]. This guide provides an objective comparison of the performance of major molecular string representations—SMILES, SELFIES, SAFE, t-SMILES, and the emerging fragSMILES—in forward and retrosynthesis prediction tasks, providing researchers with experimental data to inform their selection.
To ensure fair comparison across different molecular representations, researchers have established standardized evaluation protocols. The following workflow outlines the typical experimental setup for benchmarking performance in synthesis prediction tasks, based on established methodologies in the field [61] [62]:
Core Experimental Components:
Table 1: Key computational tools and resources for synthesis prediction research
| Tool/Resource | Type | Primary Function | Application in Research |
|---|---|---|---|
| USPTO Dataset | Chemical Reaction Database | Provides curated reaction data for training and benchmarking | Standardized evaluation of model performance [61] [62] |
| Transformer Architecture | Deep Learning Model | Sequence-to-sequence translation of molecular representations | Core prediction engine for forward and retrosynthesis tasks [61] [64] |
| RDKit | Cheminformatics Toolkit | Cheminformatics operations, molecule validation, and fingerprint generation | Molecular processing, standardization, and metric calculation [62] |
| BRICS Algorithm | Fragment-Based Method | Retrosynthetic fragmentation of molecules into building blocks | Construction of fragment-based representations and training data augmentation [64] |
| fragSMILES Generator | Molecular Representation | Converts molecules to fragment-based chiral-aware strings | Creating input for fragment-based prediction models [61] |
Forward synthesis prediction involves predicting the products of a given set of reactants. The following table compares the performance of different molecular representations on this task, with validity ensuring chemically plausible outputs and accuracy measuring exact matches to expected products [61]:
Table 2: Forward synthesis prediction performance on USPTO test set (n=50,234 reactions)
| Representation | Validity (Top-1) | Validity (Top-5) | Accuracy (Top-1) | Accuracy (Top-5) |
|---|---|---|---|---|
| SMILES | 96.3% | ~99.5% | ~20.9% | ~35.6% |
| SELFIES | 96.4% | 98.2% | 21.0% | 33.0% |
| SAFE | 92.8% | 97.6% | 30.2% | 44.1% |
| t-SMILES | 100.0% | 100.0% | 6.1% | 12.0% |
| fragSMILES | ~96.8% | 99.5% | 53.4% | 67.1% |
Retrosynthesis prediction aims to identify potential reactants and reagents needed to synthesize a target molecule. The performance across representations varies significantly, with notable trade-offs between validity and accuracy [61]:
Table 3: Retrosynthesis prediction performance on USPTO test set (n=50,234 reactions)
| Representation | Validity (Top-1) | Validity (Top-5) | Accuracy (Top-1) | Accuracy (Top-5) |
|---|---|---|---|---|
| SMILES | 41.7% | 81.1% | ~12.5% | ~28.5% |
| SELFIES | 79.7% | 97.5% | 0.0% | 0.1% |
| SAFE | 43.6% | 77.7% | 7.4% | 13.9% |
| t-SMILES | ~99.9% | ~100.0% | 0.0% | 0.0% |
| fragSMILES | 55.8% | 88.3% | 8.4% | 20.1% |
Chirality recognition presents particular challenges in synthesis prediction. The following diagram illustrates how different representations handle stereochemical complexity, a key differentiator in practical applications [61]:
When evaluated on a chiral-enriched subset of reactions (n=8,587), fragSMILES demonstrated superior performance in accurately predicting stereochemical outcomes with 97.7% validity and 48.9% top-1 accuracy, significantly outperforming other representations in this challenging aspect of reaction prediction [61].
Traditional binary accuracy metrics provide limited insight into model performance. The Retro-Synth Score addresses this limitation through a multi-faceted evaluation approach [62]:
This framework helps researchers identify "better mistakes" and provides a more realistic assessment of model performance for practical applications [62].
Different molecular representations exhibit distinct error profiles that impact their practical utility:
The comparative analysis reveals that molecular representation choice significantly impacts prediction performance, with distinct trade-offs between validity, accuracy, and stereochemical capability.
For forward synthesis prediction, fragSMILES demonstrates superior performance with 53.4% top-1 accuracy while maintaining high validity, making it the preferred choice for most applications. Its fragment-based approach and chirality awareness provide significant advantages in real-world synthesis planning where stereochemistry is crucial [61].
For retrosynthesis prediction, the choice is more nuanced. While fragSMILES achieves the highest accuracy (8.4% top-1), SELFIES provides exceptional validity (79.7% top-1), making it suitable for applications where chemical plausibility is prioritized over exact matches [61].
Researchers should consider the Retro-Synth Score framework for more comprehensive model evaluation, as it provides nuanced insights beyond traditional accuracy metrics [62]. Additionally, the remarkable validity of t-SMILES (100% across both tasks) warrants further investigation, as improving its accuracy could yield breakthrough performance [61].
Future research directions include developing hybrid representations that combine the strengths of multiple approaches, enhancing chirality prediction across all representation types, and creating more comprehensive evaluation metrics that better reflect real-world synthetic utility.
Optical Chemical Structure Recognition (OCSR) is the computational process of converting chemical structure depictions from images into machine-readable representations [65]. The accuracy of OCSR systems is fundamentally influenced by the choice of molecular string representation, which serves as the target output format for these recognition tasks. This guide provides a comparative analysis of the performance of predominant molecular representations—SMILES, SELFIES, and DeepSMILES—within OCSR pipelines, synthesizing recent experimental data to inform researchers and drug development professionals.
Molecular string representations encode the graph structure of a molecule into a linear string format, enabling storage, search, and machine learning applications. The performance of these representations varies significantly when used in deep learning-based OCSR models.
SMILES (Simplified Molecular-Input Line-Entry System): A line notation using ASCII strings to represent molecular graphs [65]. While intuitive, it allows for multiple valid strings for the same molecule and predicted strings can be syntactically invalid due to issues like unbalanced parentheses or mismatched ring closure symbols [66] [65].
SELFIES (SELF-referencing Embedded Strings): A representation designed for 100% robustness, guaranteeing that every string decodes to a valid molecule [5]. It uses a grammar based on a formal language that ensures semantic and syntactic validity, making it particularly suitable for generative models [66] [67].
DeepSMILES: A machine learning-oriented syntax that simplifies SMILES by using closing parentheses only for branches and single symbols at ring-closure locations to mitigate common syntactic issues [66] [65]. It addresses some SMILES limitations but does not guarantee molecular validity [65].
The core task in OCSR is accurately translating a 2D bitmap image of a chemical structure into its correct string representation. A 2022 study directly compared SMILES, DeepSMILES, and SELFIES using transformer models on datasets with and without stereochemistry [66].
Table 1: Performance of String Representations on Chemical Image Translation Tasks (ChEMBL Dataset)
| Representation | Accuracy (Without Stereochemistry) | Accuracy (With Stereochemistry) | Key Characteristics |
|---|---|---|---|
| SMILES | Best overall performance [66] | Best overall performance [66] | Susceptible to syntax errors (invalid outputs) [66] |
| DeepSMILES | Intermediate performance [66] | Intermediate performance [66] | Reduced syntactic issues vs. SMILES [66] [65] |
| SELFIES | Lower accuracy than SMILES [66] | Lower accuracy than SMILES [66] | Guarantees 100% valid chemical structures [66] [5] |
| InChI | Not appropriate for the learning task [66] | Not appropriate for the learning task [66] | Low token count, very long maximum string length [66] |
The study concluded that while SMILES exhibits the best overall translation performance, its primary drawback is the production of invalid outputs. Conversely, SELFIES guarantees valid chemical structures, a significant advantage for automated pipelines, albeit at the cost of lower prediction accuracy in the image translation step [66]. DeepSMILES offers a middle ground, performing between SMILES and SELFIES [66].
The introduction of the "in-the-wild" WildMol benchmark in 2025, comprising 20,000 human-annotated samples from real PDFs, provides insights into the performance of modern OCSR tools that typically use SMILES or its variants [68] [69].
Table 2: OCSR Tool Accuracy on the WildMol-10K Benchmark [68]
| OCSR Tool | Underlying Model Type | Reported Accuracy on WildMol-10K |
|---|---|---|
| MolParser | End-to-end (Extended SMILES) | 76.9% [68] |
| MolScribe | Not Specified | 66.4% [68] |
| DECIMER | Transformer-based (SMILES/SELFIES) | 56.0% [68] |
| MolGrapher | Graph-based | 45.5% [68] |
| OSRA 2.1 | Rule-based | 26.3% [68] |
| MolVec 0.9.7 | Rule-based | 26.4% [68] |
| Imago 2.0 | Rule-based | 6.9% [68] |
MolParser's state-of-the-art performance utilizes an "Extended SMILES" (E-SMILES) representation, designed to overcome standard SMILES limitations in representing complex entities like Markush structures, connection points, and polymers found in patents and literature [69]. This demonstrates that enhancements to the SMILES syntax can yield significant accuracy improvements on challenging real-world data.
The comparative study from Digital Discovery (2022) employed the following methodology [66]:
The 2025 MolParser study established a new benchmark for robust OCSR using this protocol [69]:
The following diagram illustrates the general workflow for experimentally evaluating molecular string representations in an OCSR task, as implemented in recent studies [66] [69].
OCSR Representation Evaluation Workflow
Table 3: Essential Tools and Platforms for OCSR Research
| Tool/Platform | Type | Primary Function in OCSR |
|---|---|---|
| RDKit | Cheminformatics Library | Molecular depiction generation, SMILES manipulation, and fingerprint calculation [66] [69] [67]. |
| Chemistry Development Kit (CDK) | Cheminformatics Library | Open-source library for generating 2D structure depictions and manipulating chemical data [66] [67]. |
| DECIMER Platform | Deep Learning OCSR Suite | Open-source platform for end-to-end chemical structure image segmentation, classification, and recognition [70]. |
| MolParser | Deep Learning OCSR Model | State-of-the-art end-to-end model for recognizing chemical structures, including complex Markush structures [68] [69]. |
| MolScribe | Deep Learning OCSR Model | A high-performing OCSR tool that achieved 66.4% accuracy on the WildMol benchmark [68]. |
| SELFIES Python Package | Library | Enables conversion between SELFIES and SMILES representations, ensuring 100% molecular validity [5] [67]. |
| WildMol & MolParser-7M | Benchmark Dataset | Large-scale, publicly available datasets for training and evaluating OCSR models on real-world data [68] [69]. |
| MARCUS | Integrated Curation Platform | Web-based platform combining multiple OCSR engines and text annotation for extracting molecular data from natural product literature [71]. |
The evaluation of molecular string representations reveals a critical trade-off in OCSR performance. SMILES and its extended variants currently deliver the highest prediction accuracy on real-world benchmarks, as evidenced by MolParser's 76.9% accuracy on WildMol [68] [69]. However, SELFIES provides a crucial guarantee of molecular validity, which can significantly streamline automated data curation pipelines by eliminating syntax errors [66] [5]. The choice between representations depends on the specific application: SMILES-based systems are preferable for maximum raw accuracy where invalid outputs can be tolerated or filtered, while SELFIES is advantageous for fully automated workflows requiring robust and immediately usable results. Future developments will likely focus on hybrid approaches and enhanced representations that combine the high accuracy of SMILES with the robustness of SELFIES.
In the field of artificial intelligence (AI)-driven drug discovery and materials science, the translation of molecular structures into a computer-readable format serves as the foundational step that significantly influences the success of all subsequent modeling tasks [3]. Molecular representation bridges the gap between chemical structures and their biological, chemical, or physical properties [3]. Among the various representation methods, Simplified Molecular-Input Line-Entry System (SMILES) and Self-Referencing Embedded Strings (SELFIES) have emerged as prominent string-based notations, each with distinct strengths and weaknesses in handling complex molecular features, particularly stereochemistry [1] [72]. Stereochemistry—the three-dimensional spatial arrangement of atoms—profoundly impacts a molecule's properties and biological activity [72]. For instance, the enantiomers of methadone demonstrate drastically different pharmacological effects: while one isomer provides pain relief, the other can cause severe cardiac side effects [72]. This review provides a comprehensive comparison of SMILES and SELFIES, evaluating their accuracy in representing stereochemistry and complex molecules through experimental data and benchmarking studies, offering researchers evidence-based guidance for selecting appropriate representation methods.
SMILES (Simplified Molecular-Input Line-Entry System), introduced in 1988, provides a compact string representation of molecular structures using ASCII characters [26] [3]. Atoms are represented by their elemental symbols, bonds are denoted with specific characters (-, =, # for single, double, and triple bonds respectively), and branches are indicated using parentheses [2]. Stereochemistry is encoded using specialized symbols: "@" and "@@" denote clockwise and counter-clockwise chirality at tetrahedral centers, while "/" and "\" indicate stereochemistry around double bonds [72] [2]. Despite its widespread adoption, SMILES has inherent limitations in robustness, as randomly generated or mutated SMILES strings often produce semantically invalid molecular representations [1] [5].
SELFIES (Self-Referencing Embedded Strings), developed more recently, addresses SMILES' validity issues through a novel approach based on a formal grammar [1] [72]. Each symbol in a SELFIES string derives its meaning from the previous symbols, ensuring that every possible string corresponds to a valid molecular graph [72]. This robustness is achieved through "overloaded tokens" and local definitions of rings and branches, eliminating the problem of invalid structures that plagues SMILES-based generative models [72]. SELFIES maintains the ability to represent stereochemical information using conventions similar to SMILES, with specialized tokens for chiral centers and E/Z isomerism [72].
Stereochemical representation remains a critical challenge for molecular string representations. Both SMILES and SELFIES natively encode key forms of stereoisomerism, including E/Z geometric diastereomers (from restricted rotation around double bonds) and R/S enantiomers/diastereomers (from chiral centers) [72]. However, important nuances exist in their implementations:
SMILES represents chirality at tetrahedral centers using "@" and "@@" tokens, and E/Z stereoisomers with "\" and "/" characters before bond symbols [72]. While descriptive, this representation can be ambiguous in complex scenarios and susceptible to invalidity when strings are modified [1].
SELFIES utilizes similar characters for stereochemical notation but embeds them within its robust grammatical framework [72]. This ensures that stereochemical information remains consistent and valid even after string manipulation. GroupSELFIES, an extension of SELFIES, further enhances stereochemical representation by defining chirality through unique tokens for each chiral center with specified attachment points [72].
Both representations have limitations in capturing more complex stereochemical phenomena such as axial chirality (atropisomers) and nontetrahedral isomerism, which are particularly relevant for transition metal complexes [72].
Table 1: Comparison of SMILES and SELFIES Representation Capabilities
| Feature | SMILES | SELFIES |
|---|---|---|
| Representation Basis | Line notation with ASCII characters [26] | Grammar-based with self-referencing tokens [72] |
| Guaranteed Validity | No [1] | Yes [72] |
| Stereochemistry Encoding | "@", "@@", "/", "\" symbols [72] | Similar symbols within robust grammar [72] |
| Human Readability | Moderate [3] | Lower [72] |
| Generative Model Performance | Higher invalid generation rates [1] | Higher validity rates in generative tasks [72] |
| Handling of Complex Molecules | Struggles with organometallics, complex biologics [1] | More robust but still limited for complex stereochemistry [72] |
Rigorous benchmarking studies provide empirical evidence for comparing SMILES and SELFIES across various computational chemistry tasks. Performance varies significantly based on the specific application, tokenization methods, and model architectures employed.
Table 2: Performance Comparison of SMILES vs. SELFIES Across Different Tasks
| Task Type | Dataset | Metric | SMILES Performance | SELFIES Performance | Notes |
|---|---|---|---|---|---|
| Biophysics/Physiology Classification [1] | HIV, Toxicology, BBB Penetration | ROC-AUC | Superior with APE tokenization [1] | Inferior to SMILES+APE [1] | BPE tokenization underperformed for both |
| Generative Model Validity [72] | Custom benchmark | % Valid Molecules | Lower validity rates [72] | Near-perfect validity [72] | Particularly evident in mutation operations |
| Molecular Property Prediction [2] | SIDER | ROC-AUC | Baseline performance [2] | +5.97% improvement with augmentation [2] | Classical LSTM models |
| Quantum-Classical Hybrid Models [2] | SIDER | ROC-AUC | Baseline performance [2] | +5.91% improvement with augmentation [2] | QK-LSTM models |
| Stereochemistry-Aware Generation [72] | Circular Dichroism | Fitness Score | Competitive but lower validity [72] | Equal or superior performance [72] | Dependent on task sensitivity to stereochemistry |
A critical study comparing tokenization methods revealed that the novel Atom Pair Encoding (APE) tokenizer, particularly when combined with SMILES, significantly outperformed traditional Byte Pair Encoding (BPE) with both SMILES and SELFIES representations in biophysics and physiology classification tasks [1] [26]. This advantage was consistent across three benchmark datasets: HIV, toxicology, and blood-brain barrier penetration, with evaluation based on ROC-AUC metrics [1].
For generative tasks, SELFIES demonstrates a clear advantage in validity. Experiments show that SELFIES consistently produces valid molecules with random mutations, whereas SMILES often generates invalid strings under similar conditions [72]. The latent space of SELFIES-based variational autoencoders is denser by two orders of magnitude, enabling more comprehensive exploration of chemical space during optimization procedures [1].
The incorporation of stereochemical information presents both challenges and opportunities for molecular representations. Research comparing stereochemistry-aware and unaware string-based generative models reveals that:
Experimental Workflow for Molecular Representation Benchmarking
Successful implementation of molecular representation studies requires specific computational tools and benchmark resources. The following table details essential "research reagents" for conducting rigorous comparisons between representation methods.
Table 3: Essential Research Reagents for Molecular Representation Studies
| Resource Name | Type | Primary Function | Relevance to Representation Comparison |
|---|---|---|---|
| MoleculeNet [73] | Benchmark Suite | Standardized evaluation for molecular ML | Provides curated datasets (HIV, Tox21, etc.) and metrics for fair comparison |
| ZINC15 [72] | Molecular Database | Source of drug-like molecules | Supplies stereochemically defined compounds for training and testing |
| RDKit [72] | Cheminformatics Toolkit | Molecule manipulation and descriptor calculation | Handles conversion between representations and stereochemical assignment |
| DeepChem [73] | Deep Learning Library | Implementation of molecular ML models | Provides built-in support for both SMILES and SELFIES processing |
| Transformers Library [1] | NLP Framework | Implementation of BERT and other architectures | Enables language model approaches to molecular representation |
| APE Tokenizer [1] | Specialized Algorithm | Atom-aware tokenization for chemical strings | Enhances SMILES performance by preserving chemical context |
The comparative analysis of SMILES and SELFIES reveals a nuanced landscape where neither representation universally dominates across all applications. The choice between them should be guided by specific research priorities and task requirements:
For generative molecular design tasks, particularly those employing variational autoencoders, reinforcement learning, or genetic algorithms, SELFIES is generally preferred due to its guaranteed validity and more explorable latent space [72]. Its robustness against invalid structure generation significantly accelerates the discovery process.
For classification and regression tasks involving established molecular datasets, SMILES combined with advanced tokenization methods like APE may yield superior performance [1] [26]. The preservation of contextual relationships between chemical elements enhances predictive accuracy in these applications.
For stereochemistry-sensitive applications, the decision is more complex. While both representations can encode stereochemical information, SELFIES provides more reliable conservation of this information during string manipulation and generation [72]. However, the increased complexity of stereochemistry-aware search spaces may not always justify the computational cost for tasks where stereochemistry is secondary [72].
Emerging approaches including GroupSELFIES [72], augmented SELFIES [2], and multimodal representations [3] point toward future developments where hybrid approaches or entirely new paradigms may overcome current limitations.
Researchers should carefully consider their specific objectives, validity requirements, and stereochemical emphasis when selecting between SMILES and SELFIES representations. As molecular machine learning continues to evolve, the optimal choice may increasingly depend on the integration of representation with specialized tokenization methods and model architectures tailored to specific chemical intelligence tasks.
The selection of an appropriate molecular representation is a foundational decision in AI-assisted drug discovery and materials science [17]. The accuracy of downstream tasks, from property prediction to de novo molecular generation, is intrinsically linked to how a molecule is represented for a computer model [1]. This guide provides a comparative evaluation of the dominant molecular string representations—SMILES, SELFIES, SMARTS, and IUPAC—framed within the broader thesis that no single representation is universally superior; rather, the optimal choice is dictated by the specific use case and the model employed [17] [1]. We summarize critical performance data and detail experimental methodologies to equip researchers with the information needed to make informed decisions.
The performance of molecular representations varies significantly across different tasks and model architectures. The following tables synthesize quantitative findings from recent comparative studies.
Table 1: Performance Summary in Molecular Generation Tasks (Diffusion Models)
| Representation | Validity Rate | Novelty & Diversity | QED Score | SA Score | QEPPI Score |
|---|---|---|---|---|---|
| SMILES | Moderate [66] | Substantial differences [17] | Moderate [17] | Best [17] | Best [17] |
| SELFIES | 100% [74] [16] | High similarity to SMARTS [17] | Best [17] | Good [17] | Good [17] |
| SMARTS | High [17] | High similarity to SELFIES [17] | Best [17] | Good [17] | Good [17] |
| IUPAC | High [17] | Best [17] | Moderate [17] | Moderate [17] | Moderate [17] |
Table 2: Performance Summary in Predictive Modeling Tasks
| Representation | Tokenization | HIV (ROC-AUC) | Toxicity (ROC-AUC) | BBB Penetration (ROC-AUC) | SIDER (ROC-AUC) |
|---|---|---|---|---|---|
| SMILES | BPE [1] | 0.805 [1] | 0.855 [1] | 0.954 [1] | - |
| SMILES | APE [1] | 0.824 [1] | 0.881 [1] | 0.965 [1] | - |
| SELFIES | BPE [1] | 0.781 [1] | 0.842 [1] | 0.943 [1] | - |
| SELFIES | APE [1] | 0.794 [1] | 0.861 [1] | 0.951 [1] | - |
| SELFIES (Augmented) | QK-LSTM [2] | - | - | - | 0.759 [2] |
| SMILES (Augmented) | QK-LSTM [2] | - | - | - | 0.717 [2] |
Table 3: Suitability for Different Use Cases
| Use Case | Recommended Representation | Key Rationale |
|---|---|---|
| Deep Generative Models (e.g., VAEs, GAs) | SELFIES | 100% robustness ensures all generated strings are valid molecules, simplifying model architecture [74] [16]. |
| High-Accuracy Property Prediction | SMILES (with APE tokenization) | Demonstrates superior performance in classification tasks like toxicology and BBB penetration [1]. |
| Optical Chemical Recognition | SMILES | Showed the best overall performance in translating chemical images to structures [66]. |
| Exploring Novel Chemical Space | IUPAC | Excels in generating novel and diverse molecules in diffusion models [17]. |
| Reaction Prediction & Rule-based | SMARTS | Designed for representing molecular patterns and reaction rules [17]. |
To critically assess the comparative data, an understanding of the underlying experimental methodologies is essential.
A 2025 study provided a direct comparison of SMILES, SELFIES, SMARTS, and IUPAC representations using a consistent diffusion model framework [17].
This research investigated how tokenization methods affect chemical language model performance on benchmark classification tasks [1] [26].
This study compared string representations for the specific task of translating 2D chemical images into computer-readable structures [66].
The following diagram illustrates the standard workflow for empirically comparing different molecular representations, as employed in the cited studies.
This diagram contrasts the two primary tokenization methods, BPE and APE, used to process molecular strings for transformer-based models.
Table 4: Essential Tools for Molecular Representation Research
| Tool / Resource | Function | Relevance in Research |
|---|---|---|
| RDKit | An open-source cheminformatics toolkit [66]. | Used for molecule manipulation, visualization, descriptor calculation, and converting between representation formats. Critical for data preprocessing and validation. |
| SELFIES (Python Package) | A library for encoding and decoding SELFIES strings [74] [16]. | Installed via pip install selfies. Essential for any research workflow involving the generation or analysis of SELFIES representations. |
| MoleculeNet Benchmark | A standardized benchmark suite for molecular machine learning [1] [2]. | Provides curated datasets (e.g., HIV, Tox21, SIDER) and evaluation protocols, ensuring fair and consistent comparison of model performance across studies. |
| Transformer Models (e.g., BERT) | Neural network architectures for sequence processing [1]. | The standard model architecture for evaluating representation and tokenization schemes in predictive classification tasks. |
| Diffusion Models | A class of generative models [17]. | Increasingly the state-of-the-art for evaluating molecular representation in generative tasks like de novo molecule design. |
| Atom Pair Encoding (APE) | A specialized tokenization method for chemical strings [1]. | A key reagent for enhancing model performance by ensuring tokens map to chemically meaningful units like atoms and bonds in context. |
The evaluation of SMILES and SELFIES reveals a nuanced landscape where the optimal molecular representation is highly task-dependent. SMILES often demonstrates strong performance in classification and prediction tasks, particularly when paired with advanced tokenizers like Atom Pair Encoding (APE). In contrast, SELFIES provides critical robustness for generative models by guaranteeing molecular validity, though it may produce longer token sequences. Emerging fragment-based representations like fragSMILES show great promise, especially in handling chirality and synthetic planning. Future directions point toward hybrid models that combine the strengths of string-based and graph-based representations, alongside increased focus on data quality and model interpretability. These advancements will significantly accelerate AI-driven drug discovery by enabling more accurate, efficient, and reliable exploration of chemical space.