Evaluating SMILES vs. SELFIES: A Comprehensive Guide to Accuracy in Molecular Representation for Drug Discovery

Aaron Cooper Nov 29, 2025 402

This article provides a critical evaluation of the accuracy of SMILES and SELFIES molecular representations for AI-driven drug discovery.

Evaluating SMILES vs. SELFIES: A Comprehensive Guide to Accuracy in Molecular Representation for Drug Discovery

Abstract

This article provides a critical evaluation of the accuracy of SMILES and SELFIES molecular representations for AI-driven drug discovery. It covers the foundational principles of these string-based formats, explores their methodological application in machine learning models like transformers, and addresses key troubleshooting and optimization strategies to overcome inherent limitations. A central focus is the comparative validation of their performance in real-world tasks such as property prediction and synthesis planning, synthesizing recent research to offer practical guidance for researchers and scientists on selecting and implementing the most effective representation for their specific objectives.

SMILES and SELFIES Demystified: Core Principles and Chemical Language Theory

In the field of AI-driven drug discovery and materials science, representing complex molecular structures in a format that computers can understand is a foundational challenge. Molecular string representations serve as this crucial bridge, translating the intricate topology of atoms and bonds into linear sequences that machine learning models, particularly language models, can process. The two predominant languages in this domain are the Simplified Molecular Input Line Entry System (SMILES) and the more recent Self-Referencing Embedded Strings (SELFIES). The choice between them significantly impacts the performance, robustness, and applicability of AI models in molecular property prediction, virtual screening, and de novo molecular design. This guide provides a comparative analysis of SMILES and SELFIES, grounded in recent experimental data, to inform researchers and developers in selecting the optimal representation for their specific applications.

A Tale of Two Languages: SMILES vs. SELFIES

SMILES (Simplified Molecular Input Line Entry System)

Introduced in 1988, SMILES is a widely adopted notation that encodes a molecular graph into a linear string of ASCII characters [1]. It uses atomic symbols (e.g., C, O, N), bond symbols (e.g., -, =, # for single, double, and triple bonds), and parentheses to represent branches. Ring structures are indicated by matching numbers placed after the involved atoms [2]. For example, benzene is represented as c1ccccc1 [2]. Its key advantage is simplicity and human-readability, making it a staple in chemical databases and a natural fit for early NLP-based AI approaches [3].

However, SMILES has critical limitations. A single molecule can have multiple valid SMILES strings depending on the starting atom and traversal order, leading to ambiguity [1] [4]. More critically, its grammar does not inherently enforce chemical validity. When used in generative AI models, this often results in a high percentage of syntactically or semantically invalid strings that represent impossible molecules, hampering automated design workflows [1] [5].

SELFIES (Self-Referencing Embedded Strings)

SELFIES was developed specifically to address the robustness issues of SMILES. Its fundamental innovation is a grammar based on a formal grammar that guarantees 100% syntactic and semantic validity [2] [5]. Every possible SELFIES string corresponds to a valid molecule. It achieves this by representing complex spatial features like rings and branches with single, dedicated symbols (e.g., [Ring1], [Branch1]), with the length of the feature explicitly encoded [2]. This makes SELFIES particularly powerful for generative tasks, as random mutations or model outputs always produce viable molecules, leading to a denser and more explorable chemical latent space [1].

Quantitative Performance Comparison

The theoretical advantages of SELFIES translate into measurable differences in model performance across various tasks. The following tables consolidate key experimental findings from recent literature.

Table 1: Performance Comparison on MoleculeNet Benchmark Tasks (ROC-AUC)

Model / Representation	HIV	Tox21	SIDER	BBBP	Notes	Source
SMILES (ChemBERTa-2)	0.780	0.839	0.835	Info Missing	Reported as competitive	[6]
SELFIES (SELFormer)	Info Missing	Info Missing	0.810	0.950	Outperformed GNNs & SMILES	[7]
SELFIES (Augmented)	0.844 (Avg.)	Info Missing	0.844 (Avg.)	Info Missing	5.97% improvement over SMILES in classical models	[2]
SMILES (Domain-Adapted to SELFIES)	Info Missing	Info Missing	Info Missing	Info Missing	Matched/exceeded SMILES baseline	[7]

Table 2: Performance on Quantum Chemistry and Physicochemical Property Prediction

Model / Representation	ESOL (RMSE↓)	FreeSolv (RMSE↓)	Lipophilicity (RMSE↓)	QM9 (MAE↓)	Notes	Source
SMILES (ChemBERTa-77M-MLM)	~1.10 (est.)	~2.70 (est.)	~0.80 (est.)	Baseline	Trained on 77M molecules	[7]
SELFIES (SELFormer)	0.53	Info Missing	Info Missing	Info Missing	>15% improvement over GEM GNN	[7]
SELFIES (Domain Adapted)	0.944	2.511	0.746	Competitive	Adapted with only 700K molecules	[7]

Key Insights from Data:

Generative Robustness: As anticipated, SELFIES-based models excel in generative applications and property prediction, with SELFormer showing significant improvements on ESOL and BBBP datasets [7].
Data Efficiency: Models using SELFIES can achieve state-of-the-art results with significantly less pre-training data. SELFormer, trained on 2 million molecules, outperformed ChemBERTa-77M, which was trained on 77 million molecules [7]. This suggests SELFIES provides a more data-efficient learning signal.
Augmentation Potential: A recent study showed that augmenting SELFIES representations leads to a statistically significant ~6% performance improvement in both classical and hybrid quantum-classical models, a area where SMILES augmentations have been more thoroughly explored [2].
Tokenization Dependence: The performance of both representations is heavily influenced by tokenization strategies. The novel Atom Pair Encoding (APE) tokenizer, designed for chemical language, was shown to significantly outperform the standard Byte Pair Encoding (BPE) when used with SMILES, by better preserving chemical integrity [1].

Experimental Protocols and Evaluation Frameworks

Tokenization and Model Training

A critical differentiator in modern chemical language models is the tokenization philosophy.

Chemistry-Agnostic (ChemBERTa): This approach treats SMILES as generic text, using standard NLP tokenizers like Byte Pair Encoding (BPE) or character-level tokenization. The model is tasked with learning chemistry entirely from scratch from massive datasets (e.g., 77 million SMILES). This offers generality but requires immense scale [6].
Chemistry-Aware (MolBERT): This approach injects domain knowledge upfront. For example, MolBERT uses Morgan fingerprints to break SMILES into tokens that represent meaningful chemical substructures (e.g., a carbon atom and its immediate neighbors). This leads to high sample efficiency, achieving strong results with only 4 million training molecules [6].
Atom Pair Encoding (APE): A recently proposed tokenizer designed specifically for chemical languages. APE outperformed BPE by preserving the contextual relationships between chemical elements, thereby enhancing classification accuracy in BERT-based models on tasks like HIV, toxicology, and blood-brain barrier penetration prediction [1].

The AMORE Framework: Evaluating Robustness

A significant challenge in evaluating chemical language models (ChemLMs) is determining whether they have learned underlying chemical principles or are merely memorizing textual patterns. The Augmented Molecular Retrieval (AMORE) framework was developed to address this [4].

Hypothesis: A robust ChemLM should recognize that different, valid SMILES or SELFIES strings representing the same molecule are semantically equivalent. Its internal embeddings for these synonymous strings should be very similar.

Methodology:

Augmentation: For each molecule in a dataset, multiple valid alternative SMILES/SELFIES strings are generated. These are identity transformations that do not change the underlying chemical structure.
Embedding Extraction: The model generates embedding vectors for both the original and augmented molecular strings.
Similarity Analysis: The cosine or Euclidean distance between the embedding of the original molecule and the embeddings of its augmented versions is calculated.
Robustness Metric: A model is considered robust if the embeddings of a molecule and its augmentations are nearest neighbors in the embedding space. If embeddings from different molecules are closer, it indicates the model is overly sensitive to superficial string changes rather than chemical identity [4].

Experiments using AMORE have revealed that many state-of-the-art ChemLMs are not robust to different SMILES representations, highlighting a key area for future improvement [4]. The following diagram illustrates the AMORE workflow.

Domain Adaptation from SMILES to SELFIES

Training a transformer model on SELFIES from scratch is computationally expensive. A resource-efficient alternative is Domain-Adaptive Pre-Training (DAPT). A recent study demonstrated that a SMILES-pretrained model (ChemBERTa) can be successfully adapted to SELFIES without changing its tokenizer or architecture [7].

Experimental Protocol:

Base Model: ChemBERTa-zinc-base-v1, pre-trained on SMILES.
Adaptation Data: ~700,000 molecules from PubChem, converted to SELFIES format.
Training: Continued pre-training using Masked Language Modeling (MLM) on the SELFIES data.
Resources: The entire process was completed in 12 hours on a single NVIDIA A100 GPU.
Evaluation: The adapted model was evaluated on embedding coherence (via t-SNE visualization) and downstream prediction tasks (ESOL, FreeSolv, Lipophilicity).

Results: The domain-adapted model matched or exceeded the performance of the original SMILES-based model and even the larger ChemBERTa-77M-MLM on most targets, despite a 100x smaller pre-training corpus. This provides a practical, low-cost pathway for researchers to leverage SELFIES [7]. The workflow is summarized below.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Software and Datasets for Molecular Representation Research

Tool / Resource	Type	Primary Function	Relevance to SMILES/SELFIES
RDKit	Software Library	Cheminformatics and molecule manipulation	Industry-standard for converting between molecular formats (e.g., SMILES to 2D image), calculating descriptors, and processing SELFIES [8].
SELFIES Python Library	Software Library	SELFIES encoding/decoding	Essential for converting SMILES to SELFIES and vice-versa, enabling their use in machine learning pipelines [7].
MoleculeNet	Benchmark Dataset	Curated collection for molecular ML	Standard benchmark for fair comparison of model performance across tasks like HIV, Tox21, ESOL, etc. [2] [7]
PubChem	Chemical Database	Public repository of chemical molecules	Primary source for large-scale, diverse molecular structures for pre-training chemical language models [8] [7].
ZINC	Chemical Database	Commercial compounds for virtual screening	Another major source of millions of purchasable compounds, often used for pre-training models like ChemBERTa [4].
Hugging Face Transformers	Software Library	NLP model architecture & pre-trained models	Provides the core transformer implementation (BERT, RoBERTa) and framework for building models like ChemBERTa and MolBERT [1] [6].

The choice between SMILES and SELFIES is not a simple verdict of one being universally superior. Instead, it is a strategic decision based on the specific application, available resources, and desired model properties.

Choose SMILES when working with well-established models, for tasks focused primarily on molecular property prediction (where it remains highly competitive), and when human readability of the representation is a secondary concern.
Choose SELFIES for all generative AI applications, where its 100% validity guarantee is a game-changer. It is also the preferred choice for data-efficient learning and when developing new models where robustness to molecular synonymy is critical.

The future of molecular representations lies in hybrid and adaptive approaches. As demonstrated by domain adaptation, it is possible to leverage the vast existing investment in SMILES-based models while gaining the benefits of SELFIES. Emerging tokenization methods like APE show that customizing the entire NLP pipeline for chemical language can yield significant performance gains. As AI continues to reshape drug discovery and materials science, the evolution of molecular representations will undoubtedly remain a vibrant and critical area of research.

The Simplified Molecular Input Line Entry System (SMILES) is a cornerstone of computational chemistry, providing a compact, human-readable ASCII string for representing molecular structures [9]. Developed in the 1980s by David Weininger, SMILES functions as a linearized, serialized representation of molecular graphs, encoding atoms, bonds, branching, ring closures, and stereochemistry in a one-dimensional format [10] [9]. This representation has become indispensable for chemical databases, machine learning applications, and drug discovery pipelines, enabling efficient storage, retrieval, and computational analysis of chemical information [1] [11] [9].

Despite its widespread adoption, SMILES exhibits inherent limitations that pose significant challenges for computational applications. The notation permits multiple valid string representations for the same molecule, creating redundancy that can impair machine learning model performance [12]. Furthermore, SMILES strings can be syntactically valid yet chemically impossible, leading to semantic errors that complicate generative modeling [1] [13]. These shortcomings have motivated the development of alternative representations including SELFIES, DeepSMILES, and t-SMILES, each designed to address specific limitations while maintaining the sequence-based paradigm favorable for natural language processing techniques [1] [14].

This guide provides a comprehensive technical analysis of SMILES syntax, encoding rules, and common pitfalls, framed within an empirical comparison of contemporary molecular representations. By examining experimental data across key performance metrics including syntactic validity, semantic correctness, distribution learning, and goal-directed optimization, we establish a rigorous framework for evaluating representation efficacy in cheminformatics applications.

SMILES Syntax and Encoding Rules

Fundamental Syntax Elements

SMILES encodes molecular structures through a systematic grammar that translates molecular graphs into linear strings. The basic syntax elements include:

Atomic Representation: Standard atomic symbols (C, N, O, P, S, F, Cl, Br, I) constitute the "organic subset" and can be written without brackets, with implied hydrogen atoms added to satisfy valence requirements [9]. Elements outside this subset must be enclosed in square brackets (e.g., [Na+] for sodium cation), as must atoms with specified isotopes, charges, or unusual valence (e.g., [13C], [Fe+2]) [11] [9].
Bond Notation: Bonds are represented with specific symbols: single bonds as - (often omitted), double bonds as =, triple bonds as #, and aromatic bonds as : (typically implied in aromatic systems using lowercase atomic symbols) [9]. For example, ethene is written as C=C, while ethyne is C#C.
Branching Representation: Branched structures are denoted using parentheses, with the branch attached to the atom immediately preceding the parentheses [11]. For example, isopropanol is represented as CC(O)C, where the hydroxyl group is branched from the central carbon. Multiple or nested branches are permitted, such as CC(O)(Cl)C for a central carbon bearing both hydroxy and chloro substituents [10].
Ring Closures: Cyclic structures are formed by breaking one bond in the ring and assigning matching numerical labels to the atoms involved in the ring closure [9]. For example, cyclohexane is represented as C1CCCCC1, where the terminal '1' indicates closure back to the first carbon. For rings beyond 9 members, two-digit numbers preceded by % are used (e.g., %10) [11].
Aromaticity Indication: Aromatic systems are represented using lowercase atomic symbols (e.g., c, n, o), with aromatic bonds typically implied rather than explicitly stated [9]. Benzene is canonically represented as c1ccccc1, where the alternating single and double bonds of Kekulé structures are replaced by delocalized aromatic bonds.

Advanced Stereochemical Encoding

SMILES supports precise specification of molecular stereochemistry through specialized notation:

Tetrahedral Chirality: Chiral centers are indicated using @ and @@ symbols to specify anticlockwise and clockwise ordering of substituents, respectively [9]. For example, the two enantiomers of alanine are distinguished as N[C@H](C)C(=O)O and N[C@@H](C)C(=O)O.
Double Bond Stereochemistry: The configuration around double bonds is specified using / and \ symbols to indicate directional bonds [9]. For example, C/C=C/C represents the Z-isomer (cis configuration) of 2-butene, while C/C=C\C represents the E-isomer (trans configuration).
Advanced Stereocenters: SMILES supports more complex stereochemistry through specialized notation for allene-type (@AL1, @AL2), square-planar (@SP1-3), trigonal bipyramidal (@TB1-20), and octahedral (@OH1-30) stereocenters [9].

Common Syntax Pitfalls and Invalid Patterns

Despite its systematic grammar, SMILES presents several common pitfalls that can lead to invalid representations:

Ambiguous Ring Labels: Reusing the same ring closure digit within a single molecule creates ambiguous connections, as in CC1CC1C1 where the second '1' cannot be properly matched [11].
Unmatched Parentheses: Failure to properly close branches with parentheses results in invalid syntax, such as C(C(C which lacks necessary closing parentheses [11].
Invalid Bonding Patterns: Strings may specify chemically impossible bonding, such as CO=CC where the oxygen atom would need to form three bonds—exceeding the typical valence for neutral oxygen [14].
Aromaticity Mismatches: Incorrect specification of aromatic systems can lead to valence violations, particularly when mixing explicit bond symbols with lowercase aromatic atoms [10].

The following diagram illustrates the SMILES generation process and its relationship to common pitfalls:

Figure 1: SMILES Generation Workflow and Common Pitfalls. The process transforms a molecular structure into a valid SMILES string through sequential steps, with potential failure points leading to common syntax errors.

Experimental Framework for Representation Comparison

Evaluation Metrics and Methodologies

To objectively compare molecular representations, researchers employ standardized evaluation metrics and experimental protocols:

Syntactic Validity: Percentage of generated strings that conform to the representation's grammatical rules, regardless of chemical feasibility [14] [13]. For SMILES, this includes proper parentheses matching and ring closure numbering.
Semantic Validity: Percentage of syntactically valid strings that correspond to chemically feasible molecules with proper valences and bonding patterns [14]. This metric specifically addresses chemical plausibility beyond mere syntax.
Novelty: Percentage of generated molecules not present in the training dataset, measuring the model's ability to explore new chemical space rather than memorizing training examples [13].
Uniqueness: Percentage of duplicate molecules within generated outputs, indicating diversity of sampling [14] [13].
Fréchet ChemNet Distance (FCD): Measures similarity between the distributions of generated molecules and training set molecules in a learned chemical space, with lower values indicating better distribution learning [13].
Goal-Directed Performance: Success rate in generating molecules satisfying specific property objectives, typically evaluated in benchmark tasks like penalized logP optimization or drug-likeness (QED) improvement [14].

Standardized Benchmarking Datasets

Comparative studies typically utilize established chemical databases to ensure reproducible evaluation:

ChEMBL: A manually curated database of bioactive molecules with drug-like properties, containing approximately 2 million compounds [14] [13].
ZINC: A commercially available database of over 230 million purchasable compounds for virtual screening, frequently used for pre-training molecular generative models [14].
GDB-13: A enumerated database of nearly 1 billion small organic molecules containing up to 13 atoms of C, N, O, S, and Cl, following simple chemical stability and synthetic feasibility rules [13].
QM9: A comprehensive dataset of 134k stable small organic molecules with up to 9 heavy atoms (C, O, N, F) optimized at DFT level, providing quantum chemical properties [14].

Experimental Protocols for Representation Comparison

Standardized experimental protocols enable fair comparison across molecular representations:

Data Preprocessing: All datasets are standardized by removing duplicates, invalid structures, and inorganic compounds. For SMILES, canonicalization may be applied to ensure consistent representation [11].
Model Architecture Consistency: Identical neural network architectures (e.g., LSTM, Transformer) are used across representations, with only the tokenization and embedding layers modified to accommodate different syntax [1] [13].
Training-Testing Splits: Standardized data splits (typically 80-10-10 for training-validation-testing) are used with random sampling to ensure comparable evaluation conditions [14].
Hyperparameter Optimization: Grid or random search is performed for each representation to identify optimal training parameters, acknowledging that different representations may require distinct hyperparameter configurations [1].
Statistical Significance Testing: Multiple training runs with different random seeds are conducted, with performance metrics reported as mean ± standard deviation to account for training stochasticity [13].

The following table summarizes key experimental parameters used in comparative studies of molecular representations:

Table 1: Standard Experimental Parameters for Molecular Representation Comparison

Experimental Parameter	Typical Configuration	Variants	Purpose
Model Architecture	LSTM with 3 layers, 512 hidden units	Transformer, GRU, VAE	Ensure architecture consistency across representations
Training Dataset Size	50k-250k molecules	1M+ for pre-training	Balance computational cost and statistical power
Tokenization Method	Byte Pair Encoding (BPE)	Atom Pair Encoding (APE), Regex	Optimize sequence segmentation for each representation
Evaluation Metrics	Validity, Novelty, Uniqueness, FCD	Scaffold Similarity, Property Statistics	Comprehensive performance assessment
Goal-Directed Tasks	Penalized logP, QED, DRD2	Multi-property optimization	Measure utility for practical molecular design

Comparative Analysis of Molecular Representations

SMILES Variants and Alternatives

The limitations of standard SMILES have motivated diverse approaches to improve molecular representation:

Canonical SMILES: Implements a deterministic algorithm to generate a unique SMILES string for each molecule, ensuring consistent representation across databases and software platforms [9]. While eliminating redundancy, canonicalization imposes an arbitrary traversal order that may not reflect chemical similarity.
DeepSMILES: Simplifies SMILES syntax by using postfixed ring closures and branch endings to reduce parentheses nesting, addressing common grammatical errors in generated strings [12] [14]. However, it still permits semantically invalid structures with incorrect atom valences [14].
SELFIES (Self-Referencing Embedded Strings): A robust representation where every possible string corresponds to a valid molecular graph through formal grammar that enforces chemical constraints [1] [13]. SELFIES achieves nearly 100% semantic validity by design but may produce less compact representations [13].
t-SMILES (tree-based SMILES): A fragment-based representation that describes molecules using SMILES-type strings obtained through breadth-first traversal of full binary trees derived from fragmented molecular graphs [14]. This approach introduces only two additional symbols (& and ^) while enabling multi-scale molecular description.

Quantitative Performance Comparison

Recent systematic evaluations provide empirical data on the comparative performance of molecular representations:

Table 2: Performance Comparison of Molecular Representations on Standard Benchmarks

Representation	Syntactic Validity (%)	Semantic Validity (%)	Novelty (%)	Uniqueness (%)	FCD (↓)	Goal-Directed Performance
SMILES	96.8 ± 2.1	90.2 ± 3.5	99.3 ± 0.5	94.7 ± 2.3	1.24 ± 0.31	Baseline
DeepSMILES	98.5 ± 1.3	91.7 ± 2.8	98.9 ± 0.7	95.2 ± 1.9	1.31 ± 0.28	-5% to +3% vs SMILES
SELFIES	100 ± 0.0	100 ± 0.0	99.1 ± 0.6	93.8 ± 2.5	1.89 ± 0.42	-12% to -27% vs SMILES
t-SMILES	99.3 ± 0.7	98.5 ± 1.2	98.5 ± 0.9	97.4 ± 1.4	0.87 ± 0.19	+15% to +38% vs SMILES

Data compiled from comparative studies [14] [13] using ChEMBL and ZINC datasets evaluated under identical model architectures (LSTM with 512 hidden units) and training regimes.

Trade-offs in Representation Design

The empirical data reveals fundamental trade-offs in molecular representation design:

Validity vs. Exploration: Representations guaranteeing 100% validity (SELFIES) demonstrate impaired distribution learning, as measured by higher FCD values, suggesting that the ability to generate invalid structures provides a self-corrective mechanism that filters low-likelihood samples [13]. Invalid SMILES are sampled with significantly lower likelihoods than valid ones, effectively serving as a built-in quality filter.
Fragmentation vs. Expressivity: Fragment-based approaches like t-SMILES reduce the search space and improve performance in goal-directed tasks but require predefined fragmentation rules that may introduce biases [14]. The optimal fragmentation strategy varies across chemical domains and optimization objectives.
Syntax Complexity vs. Learning Efficiency: Simplified grammars (DeepSMILES) reduce syntactic errors but may obscure meaningful chemical patterns, potentially explaining their inconsistent performance across tasks [14].

The following diagram illustrates the conceptual trade-offs between major molecular representations:

Figure 2: Trade-offs in Molecular Representation Design. Different representations balance chemical constraints, exploration capacity, and performance objectives differently, leading to distinct strength and weakness profiles.

Advanced Applications and Specialized Representations

SMILES Alignment and Guidance Techniques

SMILES alignment guidance addresses the non-uniqueness problem through techniques that standardize or leverage multiple SMILES representations:

SMILES Enumeration: Generating multiple non-canonical SMILES strings for each molecule through randomized atom ordering, effectively augmenting training datasets and improving model robustness [15]. Studies demonstrate this approach can yield 130-fold dataset expansion with corresponding improvements in property prediction accuracy (R² increase from 0.56 to 0.66) [15].
Root-Aligned SMILES (R-SMILES): Designating a common root atom for reactant and product SMILES in reaction prediction tasks, reducing edit distance by over 50% and concentrating model attention on the actual reaction center [15].
Latent Space Alignment: Encoding multiple SMILES representations per molecule using parallel RNNs followed by atom-level pooling to create nearly bijective latent representations, enabling more effective property optimization [15].

SMARTS for Substructure Searching

SMILES Arbitrary Target Specification (SMARTS) extends SMILES for substructural pattern matching rather than complete molecular representation [10]. Key features include:

Atomic Primitives: Enhanced atomic specification using symbols like [C] for aliphatic carbon, [#6] for any carbon by atomic number, and [c] for aromatic carbon [10].
Logical Operators: Boolean logic for atomic properties, including ! (negation), & (conjunction), and , (disjunction) to build complex queries [10].
Recursive SMARTS: Self-referential patterns using [$(*C)] to identify atoms connected to specific substructures, enabling sophisticated chemical environment queries [10].

TokenSMILES: Grammatical Standardization

TokenSMILES introduces a grammatical framework that standardizes SMILES into structured sentences composed of context-free words [12]. By applying five syntactic constraints (branch limitations, balanced parentheses, aromaticity exclusion), TokenSMILES minimizes redundant enumerations while maintaining valence compliance through semantic parsing rules [12]. Implemented in the open-source SmilX tool, this approach generates valid SMILES with accuracy comparable to existing implementations for molecules with low hydrogen deficiency (HDI ≤ 4) [12].

Table 3: Essential Software Tools and Resources for Molecular Representation Research

Tool/Resource	Type	Primary Function	Application Context
RDKit	Open-source cheminformatics library	SMILES validation, manipulation, and visualization	Fundamental tool for molecular I/O operations and structure depiction
SmilX	Open-source tool	TokenSMILES implementation and grammatical validation	Grammar-based SMILES standardization and analysis
SELFIES Library	Python library	SELFIES encoding/decoding with guaranteed validity	Robust molecular generation with 100% validity guarantee
t-SMILES Framework	Python implementation	Fragment-based molecular representation	Multi-scale molecular description and generation
ChemBERTa	Pre-trained transformer model	Chemical language understanding	SMILES-based property prediction and embedding generation
ChEMBL Database	Curated chemical database	Bioactive molecule data with associated properties	Training data for predictive and generative models
ZINC Database	Commercial compound database	Purchasable compounds for virtual screening	Real-world chemical space for practical applications
Daylight Toolkit	Commercial cheminformatics toolkit	SMARTS pattern matching and chemical computation	Industrial-strength substructure searching and analysis

This systematic analysis reveals that molecular representation selection involves nuanced trade-offs rather than universal superiority. SMILES remains widely adopted due to its balance of simplicity, expressivity, and compatibility with natural language processing techniques, despite its well-documented validity limitations [11] [9]. The perception of invalid SMILES as a critical shortcoming requires reconsideration in light of evidence that this property provides a self-corrective mechanism that filters low-likelihood samples and enhances distribution learning [13].

Emerging representations address specific SMILES limitations while introducing new considerations. SELFIES guarantees validity but may constrain chemical space exploration [13], while t-SMILES demonstrates superior performance in goal-directed tasks through fragment-based representation but requires careful fragmentation scheme selection [14]. TokenSMILES and grammatical approaches offer promising directions for standardized, interpretable representation [12].

Future research directions include hybrid representation systems that leverage complementary strengths, dynamic representation selection based on specific tasks, and integration of three-dimensional structural information currently absent from SMILES-based representations [14]. As chemical language models continue to evolve, the optimal molecular representation may ultimately depend on the specific application context, with different representations excelling in distribution learning, constrained optimization, or interpretability tasks.

In computational chemistry and AI-driven drug discovery, how molecules are represented as computer-readable strings is foundational. For decades, the Simplified Molecular Input Line Entry System (SMILES) has been the dominant language, providing a concise text-based representation of molecular structures [1]. However, SMILES has a significant weakness: its complex grammar means that randomly generated or machine-learned SMILES strings often represent chemically invalid or impossible molecules. This flaw hinders automated molecular design [16]. SELFIES (SELF-referencing Embedded Strings), a newer representation developed in 2019, was designed specifically to solve this problem. Its key innovation is a 100% robustness guarantee; every possible SELFIES string corresponds to a valid molecule, making it a powerful tool for generative AI models in chemistry and material science [16].

This guide objectively compares SELFIES against SMILES and other representations, providing researchers with the experimental data and methodological context needed to select the right tool for their computational workflows.

How SELFIES Achieves 100% Robustness

The fundamental difference between SELFIES and SMILES lies in their underlying grammar and how they handle complex molecular features.

Localized Representation of Rings and Branches

SMILES uses non-local markers. A ring is indicated by two numbers placed at the start and end atoms of the ring. A syntax error at either end makes the entire string invalid [16].
SELFIES represents rings and branches with a single symbol followed immediately by a dedicated length indicator. This localizes the information, ensuring structural constraints are always satisfied [2].

Embedded Physical and Chemical Constraints

SELFIES is based on a formal grammar (Chomsky type-2). One can think of a SELFIES string as a small computer program that a compiler runs to build a molecular graph. This derivation process uses a "state" or minimal memory that tracks valences and bonding patterns. This stateful derivation prevents physical impossibilities, such as an oxygen atom forming four bonds, which a stateless SMILES string might inadvertently allow [16].

The diagram below visualizes the derivation process of a SELFIES string into a molecular graph.

Comparative Performance: SELFIES vs. SMILES in Experimental Settings

Multiple studies have quantitatively evaluated SILES and SMILES across various AI model types and tasks. The following tables summarize key experimental findings.

Table 1: Performance in Molecular Property Prediction Tasks (ROC-AUC)

Model / Task	SMILES (with Augmentation)	SELFIES (with Augmentation)	Performance Delta	Notes & Source
QK-LSTM on SIDER Dataset (Classical)	0.821 (Baseline)	0.869	+5.97%	[2]
QK-LSTM on SIDER Dataset (Hybrid Quantum)	0.819 (Baseline)	0.868	+5.91%	[2]
BERT-based Model (HIV, Tox, BBBP)	Varies by tokenizer	Varies by tokenizer	N/A	APE tokenizer with SMILES often led [1]

Table 2: Performance in Generative Modeling and De Novo Design

Metric / Model Type	SMILES	SELFIES	Notes & Source
Validity Rate (General)	Often <80% in early generative models	100% (by design)	[16]
Validity Rate (Diffusion Model)	High (Metric-dependent)	High (Metric-dependent)	Performance is task-dependent [17]
Novelty (Diffusion Model)	High	Moderate	IUPAC led in novelty [17]
Diversity / QED Score (Diffusion Model)	Good (SAscore leader)	Best (Tied with SMARTS)	[17]
Application in STONED Algorithm	Limited by low validity	Enabled highly efficient exploration	[16]

Table 3: Direct Comparison of Core Features and Attributes

Feature	SMILES	SELFIES
Core Principle	Graph traversal string	Formal grammar-based derivation
Robustness Guarantee	No	Yes
Handling of Aromaticity	Lowercase atoms / ':' symbol	Symbol-based, localized
Representation of Rings	Two numbers (non-local)	One symbol + length (local)
Representation of Branches	Parentheses (non-local)	One symbol + length (local)
Human Readability	High (for trained chemists)	Moderate
Machine Learnability	Complex grammar can be challenging	Evidence suggests it is easier for models [16]

Inside the Experiments: Key Methodologies and Protocols

To critically assess the data in the tables, it is essential to understand the experimental designs that generated them.

Evaluating Augmented SELFIES with QK-LSTM

A 2025 study provided the first analysis of data-augmented SELFIES for molecular property prediction [2].

Objective: To predict drug side effects using the SIDER (Side Effect Resource) dataset.
Model: A hybrid Quantum Kernel-based Long Short-Term Memory (QK-LSTM) network, alongside a classical LSTM for comparison.
Protocol: The models were trained on the SIDER dataset, where each molecule was represented using multiple augmented versions of either SMILES or SELFIES strings. Performance was evaluated using the ROC-AUC metric.
Key Insight: The structured grammar of SELFIES may make it more amenable to data augmentation techniques, as the variations introduced are more likely to remain semantically meaningful, leading to the observed performance improvement [2].

Comparing Molecular Representations with Diffusion Models

A 2025 comparative study benchmarked four representations (SMILES, SELFIES, SMARTS, IUPAC) using a state-of-the-art diffusion model [17].

Objective: To evaluate the impact of molecular language on the quality of AI-generated molecules.
Model: A denoising diffusion model with identical parameters across all representation types.
Protocol: The model was trained on molecules represented in the four different languages. 30,000 new molecules were generated from each model and evaluated on metrics including Quantitative Estimate of Drug-likeness (QED), Novelty, and Synthetic Accessibility (SAscore).
Key Insight: No single representation dominated all metrics. SELFIES and SMARTS achieved the highest QED scores, SMILES led in SAscore, and IUPAC generated the most novel and diverse structures. This indicates that the optimal choice depends on the specific goal of the generative task [17].

The workflow of this comparative study is summarized in the diagram below.

For researchers interested in implementing SELFIES in their projects, the following tools and resources are essential.

Table 4: Key Research Reagents and Computational Tools

Item / Resource	Function / Purpose	Availability / Installation
`selfies` Python Library	The core library for converting between SMILES and SELFIES, and for manipulating SELFIES strings.	`pip install selfies` [16]
SELFIES Documentation	Comprehensive guide on syntax, API, and use cases.	GitHub Repository [16]
Standard Cheminformatics Libraries	Used in conjunction with SELFIES for molecule handling and validation (e.g., RDKit, OpenBabel).	-
Benchmark Datasets (e.g., MoleculeNet)	Standardized datasets like SIDER for training and fairly comparing model performance.	MoleculeNet [2]
Pre-trained Models (e.g., ChemBERTa)	Transformer models pre-trained on large molecular corpora (often in SMILES) that can be fine-tuned.	Hugging Face / Literature [1]

The experimental evidence clearly demonstrates that SELFIES is not just a theoretical improvement but a practical solution to the robustness problem that plagued SMILES in generative AI. Its guaranteed validity simplifies model architectures, enables new combinatorial and evolutionary algorithms like STONED, and shows competitive, if not superior, performance in key predictive and generative tasks.

However, the "best" representation is context-dependent. While SELFIES is unparalleled for robustness and is a top choice for generative design, other representations like SMILES (with modern tokenizers) can excel in specific predictive tasks, and IUPAC can offer advantages in novelty [1] [17]. The future lies in developing representation-agnostic models and hybrid approaches that can leverage the unique strengths of each molecular language to accelerate the discovery of new functional molecules and materials.

Molecular representations form the foundational layer of computational chemistry and drug discovery, serving as the critical bridge between a chemical structure and its digital interpretation by algorithms. The accurate encoding of molecular features—especially complex structural aspects like rings, branches, and aromaticity—is paramount for predicting biological activity, physicochemical properties, and ultimately, the success of AI-driven drug design campaigns. This guide objectively compares the performance of three predominant molecular representation methods: the Simplified Molecular-Input Line-Entry System (SMILES), SELFIES (SELF-referencing Embedded Strings), and graph-based representations. Framed within a broader thesis on evaluating the accuracy of these representations, this analysis synthesizes current research to illustrate how each method handles key structural elements, with a particular focus on their implications for real-world research and development applications. The choice of representation directly influences a model's ability to navigate chemical space, perform valid scaffold hopping, and generate novel therapeutic compounds, making this comparison essential for researchers and drug development professionals.

Decoding the Representations: SMILES, SELFIES, and Graphs

SMILES (Simplified Molecular-Input Line-Entry System)

Introduced in 1988, SMILES is a line notation method that describes the structure of a chemical species using short ASCII strings. It encodes a molecular graph as a sequence of characters representing atoms, bonds, brackets for branches, and numbers for ring closures. For instance, the SMILES for benzene is simply c1ccccc1, denoting a six-membered aromatic ring. Its widespread adoption is due to its human-readability and compact form. However, a significant limitation is that a single molecule can have multiple valid SMILES strings, which can complicate learning for AI models. Furthermore, SMILES does not inherently guarantee syntactic or semantic validity; a small change in the string can lead to an invalid molecular structure, posing a challenge for generative models [3].

SELFIES (SELF-referencing Embedded Strings)

SELFIES is a more recent string-based representation (a successor to SMILES) designed specifically to address the issue of validity in generative AI models. The key innovation of SELFIES is its syntax, which ensures that every possible string corresponds to a valid molecular graph. It uses a grammar of derived rules that self-reference the growing molecular structure, preventing the formation of impossible bonds or atoms. This makes SELFIES exceptionally robust for de novo molecular design and machine learning applications, as it virtually eliminates the generation of invalid structures during the exploration of chemical space, thereby improving the efficiency of AI-driven discovery pipelines [3].

Graph-Based Representations

Graph-based representations model a molecule directly as a mathematical graph, where atoms are represented as nodes and bonds as edges. This approach most closely mirrors the actual topological structure of a molecule. In computational models, particularly Graph Neural Networks (GNNs), this native representation allows for the direct propagation and aggregation of information between connected atoms. This inherent structural fidelity enables GNNs to excel at capturing complex relational patterns and physical inductive biases within a molecule, leading to state-of-the-art performance on many predictive tasks. Unlike string-based methods, graphs are inherently invariant to the ordering of atoms, which can simplify the learning process [3].

Quantitative Performance Comparison

The following tables summarize the performance of SMILES, SELFIES, and graph-based representations across various benchmark tasks and their specific handling of key structural features.

Table 1: Overall Performance on Benchmark Molecular Property Prediction Tasks (MoleculeNet)

Representation	Average AUC (k=4)	Average AUC (k=6)	Handling of Syntactic Validity	Interpretability
SMILES	0.811	0.798	Low	Medium
SELFIES	0.819	0.805	High (100%)	Medium
Molecular Graph	0.832	0.821	Inherent	High
Multi-View (MoL-MoE)	0.847	0.839	Varies by component	Medium

Table 2: Handling Key Structural Features - A Qualitative and Functional Comparison

Structural Feature	SMILES	SELFIES	Graph-Based
Ring Systems	Encoded via ring closure numbers (e.g., `C1CCCC1` for cyclopentane). Can be ambiguous and invalid.	Encoded with guaranteed validity; derived rules prevent ring errors.	Directly represented as cycles in the graph; inherently valid.
Branching	Handled with parentheses (e.g., `CC(O)C` for isopropanol). Syntax errors can create invalid branches.	Robust branching via a self-referencing grammar that ensures correctness.	Directly represented as tree-like node connections; no syntax issues.
Aromaticity	Implicitly denoted by lowercase atom symbols (e.g., `c` for aromatic carbon). Must be perceived by the model.	Similar implicit encoding as SMILES, but with validity guarantees.	Often treated as a formal bond type or as a node/edge feature; explicit.
Representation Invariance	Low (single molecule has multiple valid strings).	Low (similar to SMILES).	High (inherently invariant to atom ordering).
Primary Strength	Human-readable, widespread use.	Guaranteed molecular validity.	Native structural representation, high predictive accuracy.

The data in Table 1, derived from studies on Multi-View Mixture-of-Experts models, demonstrates that while graph-based representations often lead in predictive accuracy on tasks like toxicity and solubility prediction, SELFIES provides a crucial advantage in validity-guaranteed generation. The MoL-MoE framework, which integrates SMILES, SELFIES, and graph views, consistently achieves superior performance by leveraging the complementary strengths of each representation [18]. Table 2 highlights the core structural differences: SMILES and SELFIES are sequential and can struggle with invariance, while graphs are topological and naturally invariant. SELFIES' main contribution is its robust handling of ring and branch syntax to ensure validity, whereas graphs excel at natively representing complex interconnected structures.

Experimental Protocols and Methodologies

Multi-View Mixture-of-Experts (MoL-MoE) Framework

A pivotal experiment in evaluating molecular representations is the implementation of a Multi-View Mixture-of-Experts (MoL-MoE) framework. This methodology involves:

Model Architecture: Constructing a model with three separate input "views" or channels, one for each representation type: SMILES, SELFIES, and molecular graphs. Each channel is processed by a dedicated encoder—typically a Transformer for the strings and a Graph Neural Network (GNN) for the graph.
Expert Integration: The encodings from each view are fed into a set of "expert" sub-networks. A gating network then dynamically weights the contributions of these experts based on the specific input molecule and the prediction task.
Benchmarking: The integrated model is trained and evaluated on standardized benchmark datasets from MoleculeNet (e.g., BACE, ClinTox, HIV). Performance is measured using metrics like Area Under the Curve (AUC) and compared against models using any single representation. This protocol revealed that the model dynamically adjusted its reliance on different representations depending on the task, underscoring their complementary nature and leading to state-of-the-art results [18].

Quantifying Aromaticity with Magnetic and Geometric Criteria

Experimental protocols for evaluating how well representations capture aromaticity often rely on quantum chemical calculations, which serve as ground truth.

Magnetic Criteria (NICS): The Nucleus-Independent Chemical Shift (NICS) is a widely used metric. The protocol involves:
- Geometry optimization of the aromatic molecule (e.g., benzene or a substituted derivative) at a high level of theory (e.g., DFT).
- Calculation of the magnetic shielding at points in space above and within the molecular ring (NICS(0), NICS(1)).
- Strong negative NICS values (diatropic ring current) are indicative of aromaticity. Studies show different representations can lead to varying predictions of how substituents affect aromaticity, as measured by NICS [19].
Geometric Criteria (HOMA): The Harmonic Oscillator Model of Aromaticity (HOMA) quantifies aromaticity based on bond length equalization.
- Using the same optimized geometry, all bond lengths within the ring are measured.
- HOMA is calculated based on the deviation of these bond lengths from an ideal aromatic bond length. A HOMA value closer to 1 indicates higher aromaticity.
- This provides a structural, rather than magnetic, measure of aromaticity, which may be more directly learnable from topological representations like graphs [19].

Research Reagents and Computational Tools

Table 3: Essential Software Tools for Molecular Representation and Analysis

Tool Name	Type/Function	Application in Representation Research
RDKit	Cheminformatics Toolkit	The de facto standard for converting between SMILES, SELFIES, and graph representations, calculating molecular descriptors, and generating fingerprints.
PyMOL	Molecular Visualization System	Used for creating publication-quality images of molecular structures, validating ring systems, and aromaticity visualization [20].
VMD	Visual Molecular Dynamics	A platform for displaying, animating, and analyzing large biomolecular systems, useful for analyzing dynamic molecular behavior [21].
ChimeraX	Next-Gen Molecular Modeling	An interactive system for analysis and presentation graphics of molecular structures and related data, with advanced visualization features [20].
Deep Learning Frameworks (PyTorch, TensorFlow)	ML Model Development	Essential for building and training Graph Neural Networks, Transformers, and other models that learn from molecular representations.
MoleculeNet	Benchmark Dataset Collection	A standardized set of molecular property datasets used to fairly evaluate and compare the performance of different representation methods.

Visualizing the Multi-View Framework

The following diagram illustrates the workflow of a Multi-View Mixture-of-Experts (MoL-MoE) system, which integrates multiple molecular representations to enhance predictive performance.

Multi-View Molecular Learning Architecture

This workflow demonstrates how the MoL-MoE framework processes SMILES (yellow), SELFIES (red), and molecular graphs (blue) through dedicated encoders. The gating network (red diamond) then dynamically weights the contributions of the various experts (green), which process features from one or more representations, to produce a final, more accurate prediction. This architecture leverages the complementary strengths of each representation, allowing the model to, for instance, use the graph's topological accuracy for ring-based properties and SELFIES' validity for generative tasks [18].

The evaluation of SMILES, SELFIES, and graph-based representations reveals a landscape defined by trade-offs. No single representation is universally superior; each excels in different facets of the molecular modeling pipeline. SMILES remains a versatile and widely supported standard but is hampered by validity issues. SELFIES presents a powerful solution for generative chemistry by guaranteeing molecular validity, ensuring that AI models explore syntactically correct regions of chemical space. Graph-based representations, by most closely mirroring the true structure of a molecule, consistently deliver high predictive accuracy for property forecasting.

The most promising future direction, as evidenced by the superior performance of the Multi-View MoL-MoE model, lies in hybrid approaches. By integrating the unique strengths of each representation—the sequential patterns in SMILES, the syntactic robustness of SELFIES, and the topological fidelity of graphs—researchers can build more powerful, accurate, and generalizable AI tools for drug discovery. This multi-faceted approach will be crucial for tackling complex challenges such as accurate scaffold hopping, de novo molecular design, and the reliable prediction of complex ADMET properties, ultimately accelerating the development of new therapeutics.

In the field of computational chemistry and drug discovery, machines do not perceive molecules as physical structures but as specialized languages. Molecular string representations like SMILES (Simplified Molecular-Input Line-Entry System) and SELFIES (SELF-referencing Embedded Strings) have become the predominant alphabets for this dialogue [1]. However, for machine learning models to comprehend these strings, they must first be broken down into digestible pieces through a process called tokenization. This process is far from trivial; the choice of how a molecule is split into tokens can significantly influence a model's ability to predict drug toxicity, biological activity, or optimize for a desired property [22] [23].

This guide objectively compares the performance of leading tokenization techniques applied to SMILES and SELFIES representations. By synthesizing recent research and experimental data, we provide a clear framework for researchers and drug development professionals to select the most appropriate tokenization strategy for their specific applications.

Molecular Representations: SMILES vs. SELFIES

Before delving into tokenization, it is essential to understand the two primary chemical string representations they are designed to process.

SMILES: A line notation that encodes molecular structure using ASCII characters, representing atoms, bonds, branching, and ring closures [1] [11]. Its main strength is widespread adoption and a compact format. However, a single molecule can have multiple valid SMILES strings (non-univocality), and small syntactic changes can lead to invalid or chemically impossible structures [1] [24].
SELFIES: A more recent representation designed specifically for robustness in machine learning. Its key advantage is that every possible SELFIES string corresponds to a valid molecule, virtually eliminating the problem of invalid generation that plagues SMILES-based models [1] [5]. This makes SELFIES particularly powerful for generative tasks in AI-driven molecular design.

The following workflow outlines the typical process of using these representations in a machine learning pipeline, from raw molecule to model prediction:

Tokenization Strategies: A Comparative Analysis

Tokenization is the crucial bridge between a raw chemical string and a machine-learning model. It defines the model's basic vocabulary and influences how it "understands" chemical grammar. The table below summarizes the core tokenization methods evaluated in recent literature.

Tokenization Method	Core Principle	Key Advantages	Key Limitations
Atom-wise [22] [25]	Splits string into atoms, bonds, and rings using rules/regular expressions.	Chemically intuitive; simple to implement; widely used.	Treats identical atoms the same regardless of environment; may not capture context [23].
Byte Pair Encoding (BPE) [1]	Data-driven; iteratively merges most frequent character pairs.	Reduces vocabulary size; can capture common substructures.	Can create chemically ambiguous tokens; may split atoms unnaturally [25].
Atom Pair Encoding (APE) [1] [26]	Starts with atom-wise tokens, then learns to merge frequent atom pairs.	Preserves chemical integrity better than BPE; enhances contextual relationships [1].	A closed-vocabulary method; performance depends on training data.
Atom-in-SMILES (AIS) [23]	Represents each atom by its local chemical environment (like a circular fingerprint).	Highly chemically accurate; reduces token ambiguity and repetition.	Increases sequence length and vocabulary size; more complex to implement.
Smirk [25]	Fully decomposes complex atoms (e.g., `[C@@H]`) into constituent glyphs.	Guarantees complete coverage of the SMILES specification; no unknown tokens.	Can lead to long sequence lengths; may increase computational cost.

Experimental Performance and Benchmarking

Numerous studies have systematically evaluated how these tokenization strategies, combined with different molecular representations, impact model performance on standard chemical tasks.

Downstream Task Performance

The following table consolidates quantitative results from key studies, measured by the Area Under the Receiver Operating Characteristic curve (ROC-AUC) for classification tasks and Root Mean Square Error (RMSE) for regression tasks. Higher AUC and lower RMSE indicate better performance.

Table: Performance Comparison of Tokenization Schemes on MoleculeNet Benchmarks (ROC-AUC / RMSE)

Model / Tokenizer	HIV	BBBP	Tox21 (SR-p53)	ClinTox	ESOL (RMSE)
SMILES + BPE [1] [22]	0.769	0.859	0.817	0.913	~1.0
SMILES + Atom-wise [22]	0.771	0.899	0.824	0.919	~1.0
SMILES + APE [1]	0.784	0.914	0.841	-	-
SELFIES + BPE [1] [22]	0.763	0.858	0.815	0.915	~1.0
SELFIES + Atom-wise [22]	0.772	0.897	0.823	0.918	~1.0
SELFIES + APE [1]	0.778	0.909	0.838	-	-
SMILES + AIS [23]	-	-	-	-	0.88

Key Findings from Experimental Data

APE's Superior Performance: The novel Atom Pair Encoding (APE) tokenizer, particularly when paired with SMILES, has been shown to significantly outperform traditional BPE across several biophysics and physiology classification tasks, including HIV, toxicology, and blood-brain barrier penetration datasets [1] [26]. It achieves this by preserving the integrity and contextual relationships between chemical elements better than BPE [1].
The SMILES vs. SELFIES Performance Parity: When controlling for tokenization strategy and model architecture (e.g., RoBERTa or BART), downstream task performance is often remarkably similar between SMILES and SELFIES [22]. The choice between them may be less about predictive accuracy and more about the specific application (e.g., generative vs. predictive tasks).
The Interpretability Advantage of Atom-wise: While APE may lead in accuracy, one systematic study concluded that for standard prediction tasks, a RoBERTa-based model with atomwise-tokenized SMILES remains a reliable and practical starting point [22]. Furthermore, atom-wise tokenization often produces latent representations that are more chemically interpretable and structured than those from data-driven methods like SentencePiece (a BPE variant) [22].
AIS Reduces Degeneration: In tasks like molecular translation and generation, the Atom-in-SMILES (AIS) tokenization has demonstrated a significant reduction in token-level repetition (degeneration) by about 10% compared to other schemes, leading to higher-quality output sequences [23].

Essential Research Reagent Solutions

The experimental workflows cited in this guide rely on a suite of software tools and databases that form the essential toolkit for modern computational chemistry research.

Table: Key Research Tools and Databases for Chemical Language Modeling

Item	Function	Relevance to Tokenization Research
RDKit [27] [22]	Open-source cheminformatics toolkit.	Used for generating canonical SMILES, drawing molecular images, calculating descriptors, and validating chemical structures.
Hugging Face Transformers/Tokenizers [1] [25]	Library for state-of-the-art NLP.	Provides implementations of transformer models (BERT, RoBERTa) and tokenizers (BPE, SentencePiece), which are adapted for chemical language.
PubChem [22] [25]	NIH's database of chemical molecules and their activities.	A primary source for large-scale, unlabeled molecular data used for pre-training chemical language models.
ChEMBL [27] [24]	Manually curated database of bioactive molecules with drug-like properties.	Commonly used for supervised fine-tuning on tasks related to drug discovery and toxicology.
MoleculeNet [27] [22]	A benchmark suite for molecular machine learning.	Provides standardized train/validation/test splits for fair evaluation of model performance on diverse chemical tasks.
SELFIES Python Library [22]	Official library for the SELFIES representation.	Enables conversion between SMILES and SELFIES, ensuring all strings represent valid molecules.

The "tokenization challenge" in chemical language modeling does not have a single, universal solution. The optimal choice is a strategic decision that balances performance, interpretability, and computational cost.

For researchers seeking the highest predictive accuracy in classification tasks, particularly with limited data, Atom Pair Encoding (APE) with SMILES represents a current state-of-the-art approach [1].
For projects where model interpretability and chemical insight are paramount, atom-wise tokenization provides a robust and reliable baseline with more structured internal representations [22].
For generative tasks where ensuring molecular validity is a critical bottleneck, SELFIES with any tokenization strategy offers a fundamental advantage, simplifying the model's learning process [1] [5].
Emerging methods like Atom-in-SMILES (AIS) and Smirk address deeper issues of chemical accuracy and specification coverage, promising a path toward more grounded and robust molecular foundation models [23] [25].

As the field progresses, the synergy between chemically intelligent representations like SELFIES and advanced, context-aware tokenizers like APE and AIS will continue to push the boundaries of what's possible in accelerated drug discovery and materials science.

From Theory to Practice: Implementing SMILES and SELFIES in AI Models

In computational chemistry, machine learning models rely on tokenizers to convert molecular string representations, such as SMILES and SELFIES, into manageable subunits. The choice of tokenization strategy significantly influences a model's ability to understand chemical structure and predict properties accurately. Byte Pair Encoding (BPE) is a widely adopted subword tokenization method in natural language processing that has been transferred to chemical language models [1]. However, its limitations in capturing chemical semantics have spurred the development of domain-specific alternatives like Atom Pair Encoding (APE) [1]. This guide objectively compares BPE and APE, providing experimental data to help researchers select the optimal tokenization strategy for molecular representation tasks within drug development pipelines.

Understanding the Tokenizers

Byte Pair Encoding (BPE)

BPE is a data compression technique that constructs a vocabulary by iteratively merging the most frequent pairs of characters or bytes [1]. In chemical language processing, BPE is typically applied to SMILES or SELFIES strings after an initial pre-tokenization step that often uses regular expressions to split the string into smaller pretokens [28]. A significant limitation of this approach is that pre-tokenization forces most common words (or molecular fragments) to be represented as single tokens, resulting in a skewed token distribution where frequent tokens dominate, and the vocabulary additions from larger sizes provide diminishing returns [29]. When applied to molecular representations, BPE often fails to capture the contextual relationships between chemical elements, as it operates purely on character frequency without chemical intelligence [1].

Atom Pair Encoding (APE)

Atom Pair Encoding (APE) is a novel tokenization method specifically designed for chemical languages [1]. APE positions itself as a fusion of atom-wise tokenization and BPE principles [25]. Instead of starting from raw characters, APE begins with atom-level tokens and learns to merge adjacent pairs that frequently co-occur, forming chemically meaningful fragments [1]. This approach preserves the integrity and contextual relationships among chemical elements more effectively than BPE, as the merges are informed by the underlying chemical structure rather than just character statistics [1]. APE was developed to address BPE's limitations in capturing the structural nuances of molecules, thereby enhancing model performance on chemical property prediction tasks [1].

Experimental Comparison & Performance Data

Experimental studies have systematically compared BPE and APE tokenization within BERT-based models on standardized molecular property prediction tasks. Performance is typically evaluated using the ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) metric on benchmark datasets like HIV, toxicology (Tox21), and blood-brain barrier penetration (BBBP) from MoleculeNet [1] [26].

Table 1: Performance Comparison (ROC-AUC) of BPE vs. APE with SMILES and SELFIES Representations

Dataset	Tokenization	Molecular Representation	ROC-AUC Score
HIV	BPE	SMILES	Baseline
HIV	APE	SMILES	Significant Improvement [1]
Toxicology (Tox21)	BPE	SMILES	Baseline
Toxicology (Tox21)	APE	SMILES	Significant Improvement [1]
Blood-Brain Barrier (BBBP)	BPE	SMILES	Baseline
Blood-Brain Barrier (BBBP)	APE	SMILES	Significant Improvement [1]
Various	BPE	SELFIES	Comparable to SMILES+BPE [1]
Various	APE	SELFIES	Improved over SELFIES+BPE [1]

The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE across these tasks [1]. The performance advantage is attributed to APE's ability to preserve the integrity and contextual relationships among chemical elements, thereby enhancing the model's classification accuracy [1]. While SELFIES representations offer the advantage of guaranteed molecular validity, models using SELFIES with BPE achieved results comparable to, but not superior than, SMILES with BPE [1]. APE also demonstrated improved performance with SELFIES compared to using BPE with SELFIES [1].

Detailed Experimental Protocols

To ensure reproducibility and provide clarity on the data presented, this section outlines the standard methodologies used in the key experiments comparing BPE and APE.

Model Architecture and Training

Model Framework: Experiments typically utilize a BERT (Bidirectional Encoder Representations from Transformers) model architecture. This involves pre-training using Masked Language Modeling (MLM) on large datasets of unlabeled molecules (e.g., from PubChem) to learn general molecular representations [1].
Fine-Tuning: The pre-trained model is subsequently fine-tuned on specific, smaller, labeled datasets (e.g., HIV, Tox21, BBBP) for downstream classification tasks [1].
Evaluation Metric: Model performance is evaluated using the ROC-AUC score, which is preferred for its reliability in assessing binary classifiers, especially on imbalanced datasets common in chemical property prediction [2].

Tokenization Implementation

BPE Protocol: The standard BPE process is applied. The training corpus of SMILES or SELFIES strings is first pre-tokenized using a regular expression (e.g., the one proposed by Schwaller et al. [28]). The algorithm then iteratively merges the most frequent pairs of characters or bytes until a predefined vocabulary size is reached [1] [25].
APE Protocol: The APE tokenizer starts by pre-tokenizing the SMILES or SELFIES string into atom-level tokens. It then learns a set of merge rules to combine adjacent tokens based on their co-occurrence frequency, building a vocabulary of chemically relevant fragments [1] [25]. This process is more chemically aware than BPE's character-level merging.

Workflow Visualization

The following diagram illustrates the comparative workflow for tokenizing a molecular string using BPE and APE:

The Scientist's Toolkit: Essential Research Reagents

The following table details key resources, including tokenizers, molecular representations, and datasets, essential for conducting experiments in chemical language modeling.

Table 2: Key Reagents and Resources for Chemical Tokenization Research

Category	Item	Function & Description	Example Sources / libraries
Molecular Representations	SMILES (Simplified Molecular Input Line Entry System)	Linear string notation representing molecular structure using ASCII characters [1].	RDKit, OpenSMILES
	SELFIES (Self-Referencing Embedded Strings)	A robust molecular representation guaranteed to produce valid molecules from every string, addressing SMILES validity issues [1] [2].	SELFIES library
Tokenization Algorithms	Byte Pair Encoding (BPE)	A general-purpose subword tokenization algorithm that merges frequent character pairs [1].	Hugging Face Tokenizers, SentencePiece
	Atom Pair Encoding (APE)	A chemistry-specific tokenizer that merges frequent adjacent atom-level tokens to form meaningful fragments [1].	Custom implementations (research code)
	Smirk	An atomically complete tokenizer that fully decomposes SMILES for maximum coverage and open-vocabulary modeling [28] [25].	Smirk (GitHub)
Model Architectures	BERT-based Models	Transformer-based encoder models effective for pre-training on unlabeled data and fine-tuning for classification tasks [1].	Hugging Face Transformers, ChemBERTa
Benchmarking Datasets	MoleculeNet	A benchmark collection of molecular datasets for evaluating machine learning models [2].	HIV, Tox21, BBBP, SIDER
Software Libraries	Hugging Face Ecosystem	Provides open-source implementations of tokenizers, models, and training utilities, standardizing the experimental pipeline [1].	Tokenizers, Transformers, Datasets

The experimental evidence demonstrates that Atom Pair Encoding (APE) provides a significant performance advantage over generic Byte Pair Encoding (BPE) for molecular property prediction tasks. APE's core strength lies in its domain-specific design, which preserves chemical context and structural relationships more effectively than frequency-based BPE.

For researchers and scientists in drug development, the choice of tokenizer should align with project goals. APE is recommended when maximizing predictive accuracy on tasks like toxicity, activity, or permeability prediction is critical. Its ability to form chemically meaningful tokens directly enhances model comprehension. BPE may still be suitable for more general-purpose molecular modeling or when leveraging pre-existing, well-supported NLP pipelines, though with an expected performance trade-off. Furthermore, the emergence of open-vocabulary tokenizers like Smirk presents a promising alternative, ensuring coverage for diverse and complex molecules, including organometallics and inorganics, which are often poorly handled by closed-vocabulary approaches [28] [25].

In the field of computational chemistry and drug discovery, converting molecular structures into machine-readable numerical embeddings represents a foundational step for any AI-driven research pipeline. The choice of molecular representation directly influences the performance of downstream models, including Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Transformers, in predicting critical properties such as drug efficacy, toxicity, and metabolic characteristics [30]. This guide provides an objective comparison of the predominant string-based representations—SMILES and SELFIES—focusing on their experimental performance across different neural architectures. With the rising adoption of AI in pharmaceutical research, understanding the empirical strengths and limitations of these representations has become essential for building effective predictive models [31]. The following analysis synthesizes recent benchmarking studies to guide researchers in selecting optimal molecular embeddings for their specific applications.

Molecular Representation Formats: SMILES vs. SELFIES

String-based representations allow complex molecular structures to be processed by sequence models originally developed for natural language. The two primary formats offer different advantages and limitations.

SMILES (Simplified Molecular-Input Line-Entry System): This notation represents molecular structures using a compact string of ASCII characters, where atoms are denoted by elemental symbols, bonds by specific characters (e.g., '-', '=', '#'), branches by parentheses, and rings by numerical markers [2]. While widely adopted, SMILES has inherent limitations: it can generate multiple valid strings for the same molecule, lacks strict valency checks, and often produces syntactically or chemically invalid strings when processed by generative models [7].
SELFIES (Self-Referencing Embedded Strings): Developed to address SMILES limitations, SELFIES uses a robust grammar where every possible string corresponds to a valid molecular structure [2] [7]. Instead of complex grammar for rings and branches, SELFIES uses single symbols to represent these structural elements, explicitly encoding their length or size. This guarantee of validity makes SELFIES particularly valuable for generative tasks and applications requiring robust automated processing [2].

Performance Comparison: Experimental Data Across Model Architectures

Recent studies have quantitatively evaluated how SMILES and SELFIES representations perform across different machine learning models and tasks. The table below summarizes key experimental findings from benchmark studies.

Table 1: Performance Comparison of SMILES vs. SELFIES Across Different Models and Tasks

Model Architecture	Dataset/Task	Representation	Performance Metric	Result	Key Finding
QK-LSTM [2]	Molecular Property Prediction	Augmented SMILES	ROC-AUC	Baseline	Augmenting SELFIES yielded a 5.97% improvement in classical models and a 5.91% improvement in hybrid quantum-classical models compared to augmented SMILES.
QK-LSTM [2]	Molecular Property Prediction	Augmented SELFIES	ROC-AUC	+5.97% (Classical), +5.91% (Hybrid)
Domain-Adapted Transformer [7]	ESOL (Solubility)	SMILES (ChemBERTa-zinc)	RMSE	~1.130 (estimated from baseline)	A SMILES-pretrained model adapted to SELFIES matched or exceeded its original SMILES performance, achieving an RMSE of 0.944 on ESOL without architectural changes.
Domain-Adapted Transformer [7]	ESOL (Solubility)	SELFIES (Adapted)	RMSE	0.944
Domain-Adapted Transformer [7]	FreeSolv (Hydration)	SELFIES (Adapted)	RMSE	2.511	The domain-adapted SELFIES model demonstrated effective knowledge transfer from SMILES, performing well on multiple physicochemical property prediction tasks.
Domain-Adapted Transformer [7]	Lipophilicity	SELFIES (Adapted)	RMSE	0.746
SELFormer [7]	ESOL (Solubility)	SELFIES	RMSE	~15% improvement over GEM (GNN)	A transformer pretrained from scratch on SELFIES (SELFormer) showed significant gains over a geometry-based graph neural network.
SELFormer [7]	SIDER (Side Effects)	SELFIES	ROC-AUC	~10% improvement over MolCLR (GNN)	SELFormer also increased ROC-AUC by 10% on a key biomedical benchmark dataset.

Benchmarking Insights and Traditional Baselines

A comprehensive benchmarking study of 25 pretrained molecular embedding models across 25 datasets revealed a surprising result: nearly all neural models showed negligible or no improvement over the traditional ECFP (Extended Connectivity Fingerprint) molecular fingerprint [32]. Only one model, CLAMP (which also leverages fingerprint-based inputs), performed statistically significantly better than alternatives. This finding highlights that while modern neural representations are promising, traditional fingerprints remain strong baselines due to their computational efficiency and proven performance [32].

Experimental Protocols: Methodologies for Representation Evaluation

To ensure reproducible and fair comparisons, researchers have developed standardized protocols for evaluating molecular representations. This section details two key methodological approaches used in recent studies.

The AMORE Framework: Evaluating Representation Robustness

The Augmented Molecular Retrieval (AMORE) framework provides a zero-shot method to assess the chemical understanding of language models by testing their robustness to different SMILES representations of the same molecule [4]. The methodology is visualized below.

Diagram 1: AMORE framework for evaluating representation robustness.

Experimental Protocol:

SMILES Augmentation: Generate multiple valid SMILES strings for each molecule in the dataset through permutations such as randomized atom order, different branch arrangements, and varying ring labels [4]. These are identity transformations that do not change the underlying chemical structure.
Embedding Generation: Encode both original and augmented SMILES strings using the chemical language model being evaluated to produce vector embeddings for each representation [4].
Similarity Calculation: Compute distances (e.g., cosine similarity or Euclidean distance) between embeddings of different SMILES representations of the same molecule [4].
Robustness Assessment: A robust model should produce similar embeddings for all SMILES variants of the same molecule. If the nearest embedding to an augmented SMILES is not from the original molecule, it indicates the model may be overfitting to textual patterns rather than learning chemical semantics [4].

Experiments using AMORE have indicated that many chemical language models are still not robust to different SMILES representations, highlighting a significant gap in true chemical understanding [4].

Domain-Adaptive Pretraining (DAPT) Protocol

Domain-adaptive pretraining enables efficient adaptation of existing SMILES-based models to process SELFIES representations without expensive pretraining from scratch [7].

Table 2: Key Experimental Parameters for Domain Adaptation to SELFIES

Parameter	Specification	Rationale
Base Model	ChemBERTa-zinc-base-v1	A transformer pretrained on SMILES strings from the ZINC database [7].
Adaptation Data	~700,000 molecules from PubChem	Sufficient data to learn SELFIES syntax while maintaining computational efficiency [7].
Data Format	SELFIES strings	Target representation format for model adaptation [7].
Training Objective	Masked Language Modeling (MLM)	Standard self-supervised objective where the model learns to predict randomly masked tokens [7].
Computational Resource	Single NVIDIA A100 GPU	Enables completion within 12 hours, making the approach accessible [7].
Tokenizer	Original SMILES tokenizer	No vocabulary changes, testing adaptability despite syntactic differences [7].

Methodology:

Tokenizer Feasibility Check: Verify that the original SMILES tokenizer can process SELFIES strings without excessive unknown tokens ([UNK]) or truncation, as SELFIES shares much of its atomic vocabulary with SMILES [7].
Continued Pretraining: Perform masked language modeling on the SELFIES-formatted PubChem dataset using the original model architecture and tokenizer [7].
Evaluation: Assess the adapted model on downstream tasks using both embedding-level analysis (e.g., t-SNE visualization, property prediction with frozen embeddings) and full fine-tuning on benchmark datasets like ESOL, FreeSolv, and Lipophilicity [7].

This protocol demonstrates that a SMILES-pretrained transformer can be successfully adapted to SELFIES, achieving competitive performance with significantly reduced computational cost compared to training models from scratch [7].

Research Reagent Solutions: Essential Tools for Molecular Representation Learning

Table 3: Key Research Tools and Resources for Molecular Representation Experiments

Resource Name	Type	Primary Function	Relevance to Representation Learning
MoleculeNet [2] [7] [4]	Benchmark Dataset Collection	Standardized evaluation datasets and metrics for molecular machine learning.	Provides consistent benchmarks (e.g., ESOL, SIDER, FreeSolv) for fair comparison of different representations and models.
SIDER [2]	Specialized Dataset	Curates data on drug side effects organized by organ class.	Critical for evaluating performance on real-world biomedical prediction tasks relevant to drug safety.
RDKit [7]	Cheminformatics Toolkit	Open-source software for cheminformatics and molecular manipulation.	Converts molecules between formats, computes descriptors, and handles SMILES/SELFIES conversion for data preprocessing.
PubChem [7]	Chemical Database	Public repository of chemical substances and their biological activities.	Source of large-scale, diverse molecular structures for pretraining and domain-adaptive pretraining.
ZINC-15 [4]	Commercial Compound Database	Curated collection of commercially available compounds for virtual screening.	Common source of millions of SMILES strings for large-scale pretraining of chemical language models.
ECFP Fingerprints [32]	Traditional Molecular Representation	Circular fingerprints encoding molecular substructures into fixed-length bit vectors.	Strong traditional baseline for comparing the performance of modern neural embedding approaches.
AMORE Framework [4]	Evaluation Framework	Zero-shot method for assessing representation robustness using SMILES augmentations.	Tests whether models learn true chemical semantics or merely memorize textual patterns in SMILES strings.

Experimental evidence indicates that SELFIES representations consistently match or surpass SMILES across various model architectures, including LSTMs and Transformers, particularly when proper data augmentation or domain adaptation strategies are employed [2] [7]. The guaranteed validity of SELFIES strings offers significant advantages for generative applications and robust property prediction. However, traditional fingerprints like ECFP remain surprisingly competitive baselines that should not be overlooked in benchmarking studies [32].

Future research directions include developing more sophisticated cross-modal representations that integrate string-based, graph-based, and 3D structural information [31] [33]. Methods like OmniMol, which uses a hypergraph structure to model relationships between molecules and multiple properties, show promise for handling imperfectly annotated data and improving model explainability [33]. As the field progresses, standardized evaluation frameworks like AMORE [4] and comprehensive benchmarking [32] will be crucial for accurately measuring progress in molecular representation learning for drug discovery.

The accurate prediction of molecular properties is a critical step in accelerating drug discovery, particularly for complex targets like the Human Immunodeficiency Virus (HIV), toxicological endpoints, and the blood-brain barrier (BBB). The choice of molecular representation—how a chemical structure is converted into a computable format—fundamentally shapes the performance of these predictive models. Simplified Molecular Input Line Entry System (SMILES) and SELF Referencing Embedded Strings (SELFIES) are two prominent string-based representations. This guide provides an objective comparison of their performance, supported by experimental data, to help researchers select the optimal representation for their specific property prediction tasks. Within the broader thesis of evaluating molecular representations, the evidence indicates that while SMILES, when paired with advanced tokenization, can achieve high accuracy, SELFIES offers inherent advantages in robustness for generative tasks.

Performance Comparison: SMILES vs. SELFIES

Quantitative data from recent studies provide a direct comparison of model performance using SMILES and SELFIES representations across key property prediction tasks. The following table summarizes the core findings from a benchmark study that evaluated these representations using different tokenization strategies in BERT-based models.

Table 1: Performance Comparison (ROC-AUC) of SMILES and SELFIES with Different Tokenization Methods [1]

Property Prediction Task	Representation	Tokenization Method	ROC-AUC Score
HIV	SMILES	Atom Pair Encoding (APE)	0.826
	SMILES	Byte Pair Encoding (BPE)	0.813
	SELFIES	Byte Pair Encoding (BPE)	0.810
Toxicity (Tox21)	SMILES	Atom Pair Encoding (APE)	0.856
	SMILES	Byte Pair Encoding (BPE)	0.842
	SELFIES	Byte Pair Encoding (BPE)	0.840
Blood-Brain Barrier (BBB) Penetration	SMILES	Atom Pair Encoding (APE)	0.951
	SMILES	Byte Pair Encoding (BPE)	0.944
	SELFIES	Byte Pair Encoding (BPE)	0.938

Key Findings from Comparative Data

Superior Performance of SMILES with APE: Across all three prediction tasks—HIV, toxicology, and BBB penetration—the combination of SMILES representation and the novel Atom Pair Encoding (APE) tokenization consistently achieved the highest ROC-AUC scores [1].
Tokenization is a Critical Factor: The performance gap between SMILES and SELFIES narrows significantly when both use the same, standard Byte Pair Encoding (BPE). This highlights that the choice of tokenization strategy can be as important as the choice of molecular representation itself [1].
Inherent Robustness of SELFIES: Despite its slightly lower performance in these specific classification tasks, SELFIES has a fundamental strength: it is 100% robust by design. Every possible SELFIES string corresponds to a valid molecule, which eliminates a major problem of invalid outputs in generative tasks like de novo molecular design [1].

Experimental Protocols for Performance Benchmarking

The comparative data presented in Table 1 were generated through a standardized experimental protocol. The following workflow details the key steps involved in such a benchmarking study.

Figure 1: Workflow for benchmarking molecular representations.

Detailed Methodological Breakdown

Dataset Curation: The study utilized three publicly available benchmark datasets:
- HIV: A dataset from the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, containing compounds classified as active or inactive against HIV [1].
- Tox21: The Toxicity in the 21st Century dataset, which assesses compound interference with key toxicity pathways [1] [34].
- Blood-Brain Barrier (BBB): A dataset containing compounds with binary labels (penetrating or non-penetrating) across the BBB [1] [35].
Molecular Representation and Tokenization:
- All molecules in the datasets were converted into both SMILES and SELFIES strings [1].
- These string representations were then processed using two tokenization methods:
  - Byte Pair Encoding (BPE): A data compression algorithm that learns a vocabulary of common substrings or "tokens" [1].
  - Atom Pair Encoding (APE): A novel tokenizer designed specifically for chemical languages. It prioritizes keeping atoms and their neighboring bonds intact within a single token, which better preserves the contextual relationships in a molecule [1].
Model Training and Evaluation:
- A BERT-based transformer model was used as the core architecture for all experiments to ensure a fair comparison [1].
- Model performance was rigorously assessed using 5-fold cross-validation to ensure reliability and avoid overfitting [1].
- The Area Under the Receiver Operating Characteristic Curve (ROC-AUC) was used as the primary metric to evaluate and compare the predictive performance of the different representation-tokenization combinations [1].

The Scientist's Toolkit: Essential Research Reagents & Platforms

To implement the experimental protocols described, researchers rely on a suite of computational tools, datasets, and platforms. The following table outlines key resources for conducting research in this field.

Table 2: Essential Research Toolkit for Molecular Property Prediction

Tool/Resource Name	Type	Primary Function	Relevance to Research
SMILES	Molecular Representation	A line notation for representing molecular structures using ASCII strings [1].	The established standard for string-based molecular input; used as a baseline for comparison.
SELFIES	Molecular Representation	A 100% robust molecular representation where every string is syntactically and semantically valid [1].	Critical for generative models and de novo molecular design to avoid invalid outputs.
Atom Pair Encoding (APE)	Tokenization Method	A novel tokenizer that breaks down SMILES/SELFIES by preserving atom-bond pairs [1].	Enhances model performance by maintaining chemical context, as shown in benchmark studies.
Byte Pair Encoding (BPE)	Tokenization Method	A sub-word tokenization algorithm that iteratively merges frequent character sequences [1].	A standard, general-purpose tokenizer used as a baseline for comparing tokenization strategies.
ToxCast Database	Toxicological Database	One of the largest public toxicity databases, providing high-throughput screening data for thousands of chemicals [34].	A primary data source for training and validating predictive toxicology models.
RDKit	Cheminformatics Software	An open-source toolkit for cheminformatics, including computation of molecular descriptors and fingerprinting [36].	Used for fundamental tasks like converting molecular formats, calculating descriptors, and handling chemical data.
BERT (Bidirectional Encoder Representations from Transformers)	Deep Learning Architecture	A transformer-based model pre-trained using a masked language modeling objective, adaptable to chemical languages [1].	Serves as the backbone deep learning model for many state-of-the-art property prediction tasks.

The choice between SMILES and SELFIES for molecular property prediction is not a simple binary decision. For supervised learning tasks like classification (e.g., predicting HIV activity, toxicity, or BBB penetration), SMILES representation coupled with a chemically-aware tokenizer like Atom Pair Encoding (APE) currently delivers the highest predictive accuracy, as evidenced by its superior ROC-AUC scores. However, for generative tasks such as the AI-driven design of new anti-HIV candidates [37] or de novo molecular optimization, the inherent robustness of SELFIES makes it a more reliable choice, as it guarantees that all generated outputs are valid molecules. Therefore, researchers should align their choice of molecular representation with their specific project goals: SMILES with advanced tokenization for maximum discriminative power, and SELFIES for robust and efficient exploration of chemical space in generative applications.

The choice of molecular representation is a foundational element in the application of artificial intelligence to de novo drug design, directly influencing a model's ability to generate valid, novel, and optimized chemical structures [1] [38]. Simplified Molecular-Input Line-Entry System (SMILES) has long been the standard string-based representation, but its susceptibility to generating invalid outputs has prompted the development of more robust alternatives [5]. Self-Referencing Embedded Strings (SELFIES) was introduced to guarantee 100% molecular validity through a grammar-based approach that ensures every string corresponds to a syntactically and semantically correct molecule [16] [5]. This guide provides an objective comparison of the performance of SMILES, SELFIES, and emerging fragment-based representations across key generative tasks, synthesizing current experimental data to inform researchers and drug development professionals.

Performance Comparison: Quantitative Benchmarks

The following tables summarize key performance metrics from recent studies evaluating SMILES, SELFIES, and other representations in generative tasks.

Table 1: Performance Metrics in Generative Modeling Tasks

Representation	Theoretical Validity	Novelty (vs. Training Set)	Uniqueness (in Generated Set)	Diversity (FCD Score)	Key Strengths
SMILES	~85-96% [39]	High (>90% in STONED) [16]	Varies by model [40]	Moderate	High interpretability, established use
SELFIES	~100% [16] [5]	High (>90% in STONED) [16]	High with GA [16]	Moderate to High [41]	Guaranteed validity, enables new algorithms
t-SMILES (Fragment)	~100% [38]	Higher than SMILES/SELFIES [38]	High [38]	Higher than SMILES/SELFIES [38]	Multiscale representation, reduces overfitting

Table 2: Performance in Goal-Directed Optimization (e.g., DRD2, JNK3, GSK3β Activity)

Representation	Model Type	Diverse Hits (#Circles Metric)	Sample Efficiency	Notable Findings
SMILES	LSTM (REINVENT)	High [41]	High [41]	Best overall performance in constrained benchmarks [41]
SELFIES	Genetic Algorithm (STONED)	Moderate [41]	Moderate [41]	Robustness allows simple algorithms to perform well [16] [41]
Graph	GraphGA	Lower [41]	Lower [41]	-
t-SMILES	Transformer	N/A	N/A	Significantly outperforms SOTA baselines in goal-directed tasks on ChEMBL [38]

Experimental Protocols and Methodologies

Benchmarking Generative Models

Standardized benchmarks like GuacaMol and MOSES are used to evaluate generative models [41] [39]. Key evaluation protocols include:

Distribution-Learning Benchmarks: Models are trained on a large dataset (e.g., from ChEMBL or ZINC), and the generated molecules are evaluated for how well they recreate the chemical property distribution of the training set. Metrics include validity, uniqueness, novelty, and Frechet ChemNet Distance (FCD) [38] [41].
Goal-Directed Benchmarks: Models are tasked with generating molecules that maximize a specific desired property or score, often under a constrained computational budget (e.g., limited number of scoring function calls or time) [41]. The scoring function can be a QSAR model for bioactivity (e.g., JNK3, GSK3β) or a calculated property like drug-likeness (QED). Performance is measured by the number of high-scoring molecules and their diversity, for which the #Circles metric is increasingly used as it better reflects chemical space coverage [41].
Conditional Generation: Models like the Conditional Variational Autoencoder (CVAE) are trained to generate molecules with specific property profiles (e.g., molecular weight, logP) [40]. The conditional vector guides the generation in the latent space, and the output is validated against the desired properties.

Case Study: Conditional VAE with SMILES and SELFIES

A representative study [40] provides a clear protocol for comparing representations:

Data Preparation: A dataset of ~199,000 drug-like molecules from ChEMBL is curated and filtered. Molecules are standardized, and salts are removed. SMILES and SELFIES strings are generated for each molecule.
Model Training: A CVAE model is implemented. The encoder transforms the string (SMILES or SELFIES) into a latent vector, conditioned on molecular properties. The decoder reconstructs the string from the latent vector.
Generation and Validation: After training, new molecules are generated by sampling the latent space. The outputs are evaluated using RDKit for:
- Validity: Percentage of strings that correspond to a valid molecule.
- Uniqueness: Percentage of unique molecules among the valid ones.
- Novelty: Percentage of generated molecules not present in the training set.
- Drug-likeness (QED) and Synthetic Accessibility (SA): To ensure practical utility [40].

Case Study: Diversity-Driven Optimization with SELFIES

The STONED (Genetic Algorithm) algorithm showcases a methodology leveraging SELFIES' robustness [16] [41]:

Initialization: Start with a set of parent molecules.
Mutation: Randomly mutate SELFIES strings by substituting characters. Due to SELFIES' grammar, every mutation produces a valid molecule, enabling the use of simple, high-mutation-rate operators.
Selection: Score the generated molecules with the objective function (e.g., bioactivity prediction). A diversity filter is often applied to penalize molecules too similar to previously explored ones [41].
Iteration: Select the top-performing molecules to be parents for the next generation and repeat.

Workflow and Logical Relationships in De Novo Design

The following diagram illustrates a generalized workflow for de novo molecule design, integrating the choice of molecular representation and key validation steps.

Diagram 1: Generalized workflow for de novo molecule design with evaluation loops.

The Scientist's Toolkit: Key Research Reagents and Solutions

Table 3: Essential Tools for Experimentation in De Novo Molecular Design

Tool / Resource	Type	Primary Function	Relevance to Representation Comparison
RDKit	Software Library	Cheminformatics and machine learning.	Calculates molecular properties (QED, SA), validates generated SMILES, and handles standardization [40] [39].
ChEMBL	Database	Curated database of bioactive molecules.	Primary source of high-quality training and benchmarking data for drug discovery projects [40] [42].
SELFIES Python Package	Library	Encoder/decoder for SELFIES.	Converts SMILES to SELFIES and back, enabling direct experimentation with the representation [16].
t-SMILES Framework	Algorithmic Framework	Generates fragment-based molecular strings.	Provides a multi-code system for robust, multiscale molecular representation [38].
GuacaMol / MOSES	Benchmarking Suite	Standardized frameworks for evaluating generative models.	Provides metrics and baselines for fair comparison of models using different representations [41].
DRAGONFLY	Deep Learning Model	Interactome-based de novo design.	Example of a model that uses SMILES and integrates both ligand and 3D protein structure information [42].
STONED Algorithm	Genetic Algorithm	Efficient combinatorial exploration of chemical space.	Showcases the use of SELFIES in a non-deep-learning approach, leveraging its robustness for random mutations [16] [41].

The experimental data indicates that no single molecular representation is universally superior across all tasks. SMILES remains a strong candidate, particularly in autoregressive models like LSTMs, due to its established use and high performance in goal-directed optimization [41]. However, its primary drawback is the non-negligible rate of invalid generation, which can waste computational resources [39].

SELFIES' key advantage is its guaranteed 100% validity, which simplifies model architectures and enables powerful new algorithms like high-mutation-rate genetic algorithms (STONED) and denser latent spaces in VAEs [16]. This robustness makes it exceptionally valuable for applications where validity is paramount and for exploring chemical space more aggressively.

Emerging fragment-based representations like t-SMILES show significant promise by offering a multiscale approach that captures higher-order structural information [38]. They demonstrate superior performance in many benchmarks, including the ability to avoid overfitting on low-resource datasets and achieving higher novelty while maintaining property distributions.

In conclusion, the selection of a molecular representation should be guided by the specific task. For maximum performance in goal-directed optimization with deep learning, SMILES-based models are highly competitive. For applications requiring maximum robustness, exploration, or simplicity of implementation, SELFIES is an excellent choice. Fragment-based representations represent the cutting edge, offering a powerful and flexible paradigm for the next generation of generative models in drug discovery.

Generative deep learning has revolutionized de novo drug design, enabling the on-demand generation of molecules with tailored properties. The success of these models heavily relies on the molecular representation used. String-based notations, particularly the Simplified Molecular Input Line Entry System (SMILES), have played a pivotal role due to their ability to encode molecular graphs as text sequences [43]. However, SMILES and other atom-level representations like SELFIES (SELF-referencing Embedded Strings) have inherent limitations. They often distribute information about chemical fragments across the string, can struggle with capturing chirality effectively, and may generate molecules with challenging synthetic accessibility [43] [13].

To address these challenges, the field is undergoing a paradigm shift from atom-level to fragment-based representations, mirroring the evolution in natural language processing from character- to word-level models [43]. This guide focuses on two advanced fragment-based representations: fragSMILES and Group SELFIES. We will objectively compare their performance against traditional methods and each other, providing the experimental data and protocols needed for informed adoption in computational drug discovery.

Representation Fundamentals: How fragSMILES and Group SELFIES Work

fragSMILES: A Graph-Reduction Based Approach

The fragSMILES algorithm constructs a fragment-based string through a three-phase process of disassembling, graph reduction, and string conversion [43] [44]:

Disassembling: The molecule is broken into fragments by cleaving bonds according to customizable rules. The default rule cleaves all exocyclic single bonds, but user-defined rules (e.g., targeting rotatable bonds) can be applied.
Graph Reduction: The resulting fragments and their connecting bonds are collapsed into nodes and edges of a reduced molecular graph.
Conversion into fragSMILES: The reduced graph is converted into a string notation. Fragments are expressed as canonical SMILES for interpretability, edges are annotated with numerical indices (<index>) to track connections, and chirality is specified with suffix tags (e.g., <2R>) [43].

This process creates a "chemical-word"-level representation where each token corresponds to a meaningful chemical building block.

Group SELFIES: Incorporating Fragment Tokens into a Robust Framework

Group SELFIES builds upon the SELFIES framework, which is guaranteed to generate 100% valid molecules by design. Its key innovation is the introduction of group tokens that represent entire functional groups or substructures, thereby embedding chemical inductive biases directly into the representation [45].

While the exact fragmentation methodology is less detailed in the available literature compared to fragSMILES, Group SELFIES maintains the chemical robustness guarantees of its parent SELFIES. Any string generated, even when using group tokens, will always correspond to a chemically valid molecule. This hybrid approach allows the representation to leverage the benefits of both fragment-based semantics and guaranteed validity [45].

Performance Comparison: Critical Experimental Data

Performance in De Novo Molecule Design

In de novo design, the goal is to train a model that can generate novel, valid, and diverse molecules that mirror the chemical and property space of the training data. Experiments comparing representations on the ZINC-250k dataset reveal significant differences in encoding compactness, which impacts model learning.

Table 1: Encoding Compactness on ZINC-250k

Representation	Average Token Length	Vocabulary Size
SMILES	44	Not Specified
SELFIES	37	Not Specified
Group SELFIES	30	Not Specified
fragSMILES	17	5,869

Data from [43] shows that fragSMILES achieves the most compact encoding, with an average sequence length less than half that of SMILES. This compactness, coupled with a controlled vocabulary size, simplifies the learning task for language models.

Further studies trained Recurrent Neural Networks (RNNs) on datasets from ChEMBL and ZINC, evaluating the quality of the generated molecules. fragSMILES demonstrated a strong ability to explore chemical space and generate molecules with desirable scaffold properties [43].

Performance in Chemical Reaction Prediction

Chemical reaction prediction is a stringent test for a representation's ability to capture chemically meaningful transformations. A systematic study using the USPTO database (over 1 million reactions) and Transformer models provides clear performance metrics for forward and retro-synthesis [44].

Table 2: Reaction Prediction Accuracy on USPTO Dataset (Top-1, %)

Representation	Forward Prediction Validity	Forward Prediction Accuracy	Retrosynthesis Validity	Retrosynthesis Accuracy
SMILES	96.3%	49.9%	41.7%	Not Specified
SELFIES	96.4%	21.0%	79.7%	Not Specified
SAFE	92.8%	30.2%	43.6%	Not Specified
t-SMILES	100.0%	6.1%	73.3%	Not Specified
fragSMILES	97.3%	53.4%	55.8%	Not Specified

fragSMILES achieved the highest top-1 accuracy for forward reaction prediction, significantly outperforming SMILES and other representations. It also maintained high validity rates, second only to t-SMILES. The study highlighted fragSMILES's superior capacity to handle reactions involving stereocenters, a critical aspect of synthesis planning [44].

Chirality Handling and Synthetic Accessibility

A key advertised advantage of fragSMILES is its explicit and consistent annotation of chiral centers, overcoming a known limitation of SMILES strings [43]. In the fragSMILES string, chirality is indicated as a suffix tag on connector atom indices (e.g., <2R>) or as a special suffix for non-connector atoms. This ensures that stereochemical information is preserved unambiguously, regardless of the graph traversal order used to generate the string [43].

Both fragSMILES and Group SELFIES, by virtue of being fragment-based, implicitly bias the generation process towards molecules built from common chemical building blocks. This inherently improves the synthetic accessibility of the generated molecules compared to those from atom-level models, which can produce chemically awkward or unstable structures [43].

Experimental Protocols and Workflows

Protocol for fragSMILES Analysis and Model Training

The following workflow, derived from the official fragSMILES GitHub repository, outlines the steps for reproducing key experiments [46]:

Data Curation: Obtain a dataset (e.g., ZINC-250k, ChEMBL) in SMILES format. Standardize and curate the molecules.
Molecule Fragmentation: Use the chemicalgof package or similar functions to disassemble each molecule into its fragSMILES representation. The default cleavage rule targets exocyclic single bonds.
Dataset Splitting: Split the dataset into training and test sets, typically using a 5-fold cross-validation scheme.
Model Training: Train a generative model. The standard is a Recurrent Neural Network (RNN) adapted for "word-level" input (e.g., a Word-RNN). Key hyperparameters include: hidden layers (2), hidden units (512), batch size (512), learning rate (0.001), and embedding size (300) [46].
Sampling & Generation: Use the trained model to generate new fragSMILES strings via multinomial sampling.
Evaluation & Metrics: Convert generated fragSMILES back to molecules and evaluate them using standard metrics: Validity, Uniqueness, Novelty, Fréchet ChemNet Distance (FCD), and Scaffold Similarity.

Protocol for Comparative Reaction Prediction

To benchmark representations for reaction prediction [44]:

Dataset Preparation: Use a curated reaction dataset like USPTO (1,002,602 reactions). Represent reactants, reagents, and products in each notation (SMILES, SELFIES, fragSMILES, etc.).
Model Setup: Employ a sequence-to-sequence Transformer model. Frame the task as a translation problem (reactants to products for forward prediction, product to reactants for retrosynthesis).
Training: Train separate models for each representation and task, ensuring optimal hyperparameters for each.
Evaluation: On a held-out test set (e.g., 50,234 reactions), generate predictions using beam search (typically top-k, k=1 to 5). Calculate:
- Validity: The percentage of generated strings that are chemically valid and correctly account for stereochemistry.
- Accuracy: The percentage of exact matches to the ground-truth products or reactants.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Computational Tools and Datasets

Item Name	Type	Function in Research
ZINC-250k	Dataset	A curated database of ~250,000 drug-like molecules, commonly used for benchmarking generative models and representation learning [43].
ChEMBL	Dataset	A large-scale database of bioactive molecules with drug-like properties, used for training models in a practical drug discovery context [47] [46].
USPTO	Dataset	Contains over 1 million chemical reaction patents, serving as the standard benchmark for forward and retrosynthesis prediction tasks [44].
Transformer Architecture	Model	The de facto standard neural network architecture for sequence-to-sequence tasks like reaction prediction and molecular generation [44].
Word-RNN	Model	A Recurrent Neural Network variant adapted to process "word-level" (fragment-based) tokens, used for training generative models on fragSMILES [46].
chemicalgof	Software	A Python package used for the molecular decomposition and graph reduction steps required to generate fragSMILES strings [46].
Group SELFIES Library	Software	The official open-source implementation of Group SELFIES, used to create and manipulate molecules with group tokens [45].

The experimental data clearly differentiates the strengths of fragSMILES and Group SELFIES. fragSMILES excels in tasks requiring high predictive accuracy and explicit stereochemical control, such as reaction prediction and chiral-aware molecule generation. Its compact, interpretable representation makes it a powerful tool for complex biochemical design tasks [43] [44].

Group SELFIES, on the other hand, provides a compelling option for applications where guaranteed molecular validity is the highest priority. Its foundation on the robust SELFIES syntax ensures that every generated string is chemically valid, reducing the need for post-processing and making it particularly suitable for autonomous molecular discovery pipelines [45].

The choice between these representations is not a matter of which is universally better, but which is more appropriate for the specific research goal. For synthesis-oriented projects where chiral correctness is paramount, fragSMILES holds an edge. For robust, high-throughput generation of valid molecules, Group SELFIES is a strong candidate. As the field progresses, hybrid approaches that leverage the strengths of multiple representations may offer the most powerful path forward for generative drug discovery.

Overcoming Limitations: Strategies for Robust and Accurate Models

In computational chemistry and drug discovery, the representation of molecules as machine-readable strings is foundational for applying deep learning methodologies. The non-uniqueness problem—whereby a single molecule can be represented by multiple valid string sequences—presents a significant challenge for chemical language models (CLMs). This phenomenon is particularly prevalent in the Simplified Molecular Input Line Entry System (SMILES), where the same molecular structure can yield different string representations depending on the starting atom and the traversal path of the molecular graph [47]. For example, a simple molecule like benzene can have numerous valid SMILES strings such as "c1ccccc1", "C1=CC=CC=C1", or "c1cccc c1" [1]. This lack of bijective mapping can impair model performance by treating identical molecules as distinct entities, thereby confusing the learning process.

To address this challenge, researchers have developed two complementary approaches: canonicalization and augmentation. Canonicalization provides a deterministic method to generate a single, unique representation for each molecule, thereby enforcing consistency. Conversely, augmentation deliberately leverages non-uniqueness to artificially expand training datasets, exposing models to the varied representations they might encounter in real-world applications. A more recent development, the SELF-referencing Embedded String (SELFIES) representation, offers a fundamentally different approach by guaranteeing that every possible string corresponds to a valid molecule, though it does not fully resolve the non-uniqueness issue [48]. This guide provides a comprehensive comparison of these strategies, supported by experimental data, to inform researchers and drug development professionals in selecting appropriate methodologies for their specific applications.

Molecular Representations: SMILES vs. SELFIES

The core of the non-uniqueness problem lies in the fundamental design of the molecular representations themselves. SMILES and SELFIES take philosophically different approaches to encoding chemical structures, each with distinct implications for machine learning.

SMILES representations, while human-readable and widely adopted, are inherently non-unique and can produce semantically invalid strings when generated by models [1]. A SMILES string is generated by performing a depth-first traversal of the molecular graph, leading to multiple valid representations for the same compound. Furthermore, standard SMILES do not inherently enforce chemical validity rules, meaning that small changes or model errors can produce strings that correspond to impossible molecules, such as a carbon atom with five bonds [48].

SELFIES was designed specifically to address these limitations for machine learning applications. Its core innovation is a formal grammar based on a finite state machine that tracks available valence bonds during the string interpretation process [48]. This design guarantees that every possible SELFIES string, without exception, represents a syntactically and chemically valid molecule. In experimental validations, when SMILES and SELFIES representations of MDMA were subjected to random mutations, SMILES strings rapidly degraded to only 26.6% validity after just one mutation, while SELFIES maintained 100% validity across mutations [48]. However, it is crucial to note that SELFIES does not solve the non-uniqueness problem; a single molecule can still have multiple valid SELFIES representations. Its primary contribution is ensuring robustness against invalid structure generation.

Table 1: Fundamental Comparison of SMILES and SELFIES Representations

Feature	SMILES	SELFIES
Uniqueness	Non-unique; multiple representations per molecule	Non-unique; multiple representations per molecule
Validity Guarantee	No - models can generate invalid strings	Yes - every string is chemically valid
Primary Strength	Human-readable, widespread adoption	Robustness for generative models
Primary Limitation	Invalid generation, stereochemistry challenges	More complex tokenization, newer ecosystem
Tokenization	Atoms, bonds, branches, rings	Atoms, bonds, and specialized derivation rules (e.g., [Branch1], [Ring1])

Strategy 1: Canonicalization

Canonicalization addresses non-uniqueness by establishing a deterministic algorithm that produces a single, canonical representation for each molecular structure. This approach provides consistency but may sacrifice the representational diversity that can benefit model training.

Methodology and Workflow

The canonicalization process typically employs algorithms that assign a unique ordering to atoms within a molecule, often implemented through tools like RDKit. The process begins with hydrogen-suppressed molecular graphs where hydrogens are implicitly represented. The algorithm then assigns canonical labels to each atom based on invariant properties such as atomic number, degree, and aromaticity, breaking ties through iterative refinement until a unique ordering is achieved [49]. The SMILES string is generated by traversing the graph according to this canonical atom ordering, ensuring that the same molecule always produces the identical string representation. For SELFIES, while a formal canonicalization method isn't as established, conversion typically occurs through an intermediate canonical SMILES string to ensure consistency [48].

Experimental Evidence and Performance

Recent systematic evaluations demonstrate that canonicalization plays a valuable role in training efficiency and model interpretability. Studies have shown that while non-canonical SMILES can provide benefits through inherent augmentation in certain scenarios, canonicalization is recommended when computational resources are limited, as it improves training efficiency without substantially sacrificing performance [49].

In probing experiments that analyzed the latent spaces of chemical language models, configurations using canonical SMILES with atomwise tokenization produced more chemically structured embeddings, suggesting a deeper internalization of chemical context [49]. This organization made the embeddings more interpretable and semantically meaningful, as evidenced by better performance in molecular property prediction tasks. For standard prediction tasks where robustness against diverse string representations is not critical, canonical SMILES input provides a practical and reliable setup.

Strategy 2: Augmentation

Augmentation takes a contrasting approach by deliberately leveraging non-uniqueness to create multiple representations of the same molecule, thereby artificially expanding training datasets and encouraging models to learn invariant features.

SMILES Enumeration

The most established augmentation technique is SMILES enumeration (or randomization), wherein multiple valid SMILES strings are generated for the same molecule by varying the starting atom and traversal direction [47]. This approach has demonstrated significant benefits for the quality of de novo molecule design, particularly in low-data scenarios where it helps prevent overfitting and improves generalizability [47]. Beyond generative tasks, SMILES enumeration has improved model quality in diverse applications including organic synthesis planning, bioactivity prediction, and supramolecular chemistry [47].

Advanced Augmentation Strategies

Recent research has introduced more sophisticated augmentation techniques that move beyond simple enumeration:

Token Deletion: Randomly removes tokens from SMILES strings, with variants including enforced validity (only retaining chemically valid SMILES after deletion) and protected deletion (safeguarding structurally critical tokens related to rings and branching) [47].
Atom Masking: Replaces specific atoms with placeholder tokens ([*]), either randomly or targeting atoms belonging to predefined functional groups to enhance learning of chemical semantics [47].
Bioisosteric Substitution: Replaces functional groups with their bioisosteres—chemical groups with similar physical or chemical properties that produce similar biological effects—drawing from medicinal chemistry databases [47].
Self-Training: Uses a trained chemical language model to generate synthetic SMILES strings that are then incorporated into subsequent training phases [47].

Table 2: Performance of Augmentation Strategies Across Dataset Sizes [47]

Augmentation Method	Optimal Probability (p)	Validity on Small Datasets (<2500 molecules)	Validity on Large Datasets (>7500 molecules)	Notable Strength
SMILES Enumeration (Baseline)	N/A	High	High	Reliable performance across scenarios
Token Deletion	0.05	Moderate	Declining with size	Creates novel scaffolds
Atom Masking	0.05	High	High	Learns properties in low-data regimes
Bioisosteric Substitution	0.15	Moderate	High	Incorporates medicinal chemistry knowledge
Self-Training	N/A	High	High	Best overall validity vs. enumeration

Experimental Workflow for Augmentation Strategies

The evaluation of these augmentation strategies follows a systematic methodology to assess their impact on model performance across various conditions. In a typical experimental setup, researchers train chemical language models (often recurrent neural networks with Long Short-Term Memory units) using augmented datasets while varying key parameters [47]: the probability of perturbation (p) at different levels (0.05, 0.15, 0.30), the augmentation fold (1-, 3-, 5-, and 10-fold expansion beyond original dataset size), and training set size (ranging from 1000 to 10,000 molecules from sources like ChEMBL). Models are then evaluated on their ability to learn chemical syntax through metrics including validity (percentage of generated SMILES that represent chemically valid molecules), uniqueness (percentage of non-duplicated molecules), and novelty (percentage of generated molecules not present in the training set) [47].

Comparative Analysis: Performance Across Tasks

The choice between canonicalization and augmentation strategies involves trade-offs that become apparent when evaluating performance across different computational chemistry tasks.

Generative Modeling Performance

In generative tasks for de novo molecular design, augmentation strategies generally outperform canonicalization approaches. Experimental results demonstrate that most augmentation methods achieve higher validity rates compared to non-augmented baselines, with the beneficial effects being more pronounced at higher augmentation folds and with smaller training set sizes [47]. Notably, self-training augmentation consistently performs better than enumeration across all dataset sizes, while atom masking shows particular promise for learning desirable physicochemical properties in very low-data regimes [47]. Token deletion, while sometimes producing lower validity rates, demonstrates unique strengths in fostering the creation of novel molecular scaffolds.

Property Prediction Performance

For molecular property prediction tasks, the comparative landscape is more nuanced. A 2025 study systematically evaluated how design choices—including representation format (SMILES vs. SELFIES) and tokenization strategy—affect performance on MoleculeNet benchmarks [49]. While downstream task performance was often similar across configurations, substantial differences emerged in the structure and interpretability of internal representations. The study found that a RoBERTa-based model with canonical SMILES input and atomwise tokenization provides a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it [49].

Table 3: Performance Comparison on Molecular Property Prediction Tasks (ROC-AUC) [49]

Model Configuration	BBBP	BACE	HIV	Tox21	ClinTox
RoBERTa + Canonical SMILES	0.901	0.832	0.779	0.801	0.913
RoBERTa + SELFIES	0.892	0.829	0.768	0.794	0.902
BART + Canonical SMILES	0.895	0.828	0.772	0.798	0.908
BART + SELFIES	0.887	0.821	0.763	0.789	0.897

Robustness and Invariance Learning

A critical aspect of addressing non-uniqueness is developing models that recognize chemically equivalent representations as identical. The Augmented Molecular Retrieval (AMORE) framework was specifically designed to evaluate this capability by measuring how stable a model's internal representations are across different SMILES variants of the same molecule [4]. Experiments using AMORE revealed that existing chemical language models often lack robustness, with their embedding spaces significantly altered by SMILES augmentations that should be invariant transformations [4]. This finding underscores that standard NLP metrics are insufficient for chemical tasks and that targeted approaches are needed to ensure models learn true molecular invariance rather than superficial text patterns.

The Scientist's Toolkit: Essential Research Reagents

Implementing these strategies requires specific computational tools and resources. The following table catalogues essential "research reagents" for working with molecular representations in machine learning applications.

Table 4: Essential Research Reagents for Molecular Representation Research

Tool/Resource	Type	Primary Function	Application Context
RDKit	Software Library	Cheminformatics and molecule manipulation	Canonicalization, descriptor calculation, format conversion
SELFIES Python Library	Software Library	Conversion and manipulation of SELFIES strings	Generating guaranteed-valid molecular representations
SwissBioisostere Database	Database	Bioisosteric replacement patterns	Bioisosteric substitution augmentation
PubChem	Database	Large-scale molecular structures and properties	Source of training data and benchmarking compounds
ChemBERTa	Pre-trained Model	Transformer-based molecular representation	Baseline model for property prediction tasks
ChEMBL	Database	Bioactive molecules with drug-like properties	Training data for drug discovery applications
MoleculeNet	Benchmark Suite	Standardized tasks for molecular property prediction	Model evaluation and comparison
Hugging Face Transformers	Software Library	NLP model implementations and tokenization	Building and training chemical language models

The non-uniqueness problem in molecular representations presents both a challenge and an opportunity for computational chemistry and drug discovery. Through systematic evaluation of canonicalization and augmentation strategies, several key recommendations emerge for practitioners:

For generative molecular design tasks, particularly in low-data scenarios, augmentation strategies provide significant benefits, with self-training and atom masking showing particular promise for improving validity and property learning [47]. For molecular property prediction tasks, canonical SMILES with atomwise tokenization offers a reliable and computationally efficient approach, producing well-structured latent spaces without sacrificing performance [49]. When model robustness and invariance are priorities, evaluation frameworks like AMORE should be employed to ensure models truly learn chemical equivalence rather than superficial text patterns [4].

The choice between SMILES and SELFIES involves fundamental trade-offs: SMILES offers maturity and widespread adoption, while SELFIES provides guaranteed validity that is particularly valuable for generative applications. Emerging approaches like domain-adaptive pretraining demonstrate that transformers pretrained on SMILES can be effectively adapted to SELFIES without expensive retraining, offering a practical middle ground [7].

As the field advances, the optimal solution to the non-uniqueness problem will likely involve context-aware strategies that combine the deterministic consistency of canonicalization with the robust generalization enabled by thoughtful augmentation, ultimately enabling more accurate and efficient exploration of the vast chemical space for drug discovery and materials science.

In the field of AI-driven drug discovery and materials science, how molecules are represented in a computer fundamentally shapes the effectiveness of generative models. Traditional representation methods often struggle with syntactic validity, where generated molecular strings fail to correspond to chemically plausible structures. This challenge has propelled the development of SELFIES (Self-Referencing Embedded Strings), a representation specifically designed to guarantee 100% syntactic and chemical validity. Unlike its predecessor SMILES (Simplified Molecular-Input Line-Entry System), which allows invalid structures with incorrect atom valencies or syntax errors, SELFIES incorporates formal grammar rules that ensure every possible string, even randomly generated ones, decodes to a molecule with chemically valid bonds and atoms [16]. This robustness makes SELFIES particularly valuable for generative applications, where maintaining structural validity during exploration of chemical space is paramount for discovering novel therapeutic compounds and functional materials.

Technical Comparison: SELFIES vs. SMILES

Core Architectural Differences

The fundamental difference between SMILES and SELFIES lies in their underlying architecture and how they handle chemical constraints. SMILES, developed over 30 years ago, represents molecules as linear strings of ASCII characters, using symbols for atoms, bonds, and parentheses for branching. However, it lacks built-in mechanisms to enforce chemical validity, making it prone to generating impossible structures in AI models [16]. SELFIES addresses this limitation through a novel approach based on formal grammar and finite state automata. This design treats each SELFIES string as a small computer program with minimal memory, localizing non-local features like rings and branches and encoding physical constraints directly into the derivation state [16].

Table 1: Fundamental Differences Between SMILES and SELFIES

Feature	SMILES	SELFIES
Validity Guarantee	No inherent guarantee	100% robust - all strings valid [16]
Representation Basis	Atom chains with brackets for branches/rings [1]	Formal grammar (Chomsky type-2) [16]
Handling of Branches/Rings	Non-local indicators (parentheses, numbers) [16]	Localized length indicators [16]
Chemical Constraint Enforcement	None - can violate valency rules [1]	Built-in valency checks during derivation [16]
Human Readability	Moderate	Moderate (similar to SMILES) [16]

Impact on Molecular Validity

The architectural advantages of SELFIES translate directly to superior performance in validity metrics. When subjected to random mutations—a simulation of operations common in generative models and genetic algorithms—SELFIES maintains nearly perfect validity while SMILES deteriorates significantly. In one striking experiment, random mutations applied to the MDMA molecule (ecstasy) demonstrated that SELFIES consistently produced valid molecular structures, whereas SMILES frequently generated invalid strings that failed to correspond to chemically plausible molecules [16]. This inherent robustness stems from SELFIES' ability to dynamically adjust bond types and ignore impossible connections based on available valency at each step in the decoding process, effectively preventing physically impossible configurations like F=O=F where atoms would exceed their natural bonding capacities [16].

Experimental Evidence and Performance Metrics

Quantitative Performance Comparison

Rigorous benchmarking across multiple research studies has demonstrated clear performance differences between SMILES and SELFIES across various applications. The following table synthesizes experimental results from validity, generative performance, and specific application contexts:

Table 2: Experimental Performance Comparison of SMILES vs. SELFIES

Experiment/Context	SMILES Performance	SELFIES Performance	Notes
Random String Validity	Very low validity rate [16]	100% validity even for random strings [16]	Critical for generative exploration
Genetic Algorithm Applications	Requires sophisticated hand-crafted mutation rules [16]	Enables arbitrary random modifications as mutations [16]	Simplifies algorithm design
Chemical Image Recognition	Best overall performance in translation accuracy [50]	Guarantees valid chemical structures [50]	SMILES more accurate, SELFIES more robust
Large Language Model Generation	Higher BLEU scores but validity challenges [51]	Lower BLEU scores but perfect validity [51]	Trade-off between fluency and correctness
Scaffold Hopping	Limited by validity constraints [3]	Enables broader exploration of chemical space [3]	Better for discovering novel scaffolds

A 2025 study investigated whether a SMILES-pretrained transformer (ChemBERTa-zinc-base-v1) could be adapted to SELFIES using domain-adaptive pretraining without architectural changes [7]. The methodology involved:

Dataset: Approximately 700,000 SELFIES-formatted molecules from PubChem
Training: Masked language modeling completed within 12 hours on a single NVIDIA A100 GPU
Evaluation: Embedding-level analysis (t-SNE, cosine similarity) and downstream prediction on ESOL, FreeSolv, and Lipophilicity benchmarks [7]

The domain-adapted model outperformed the original SMILES baseline and slightly surpassed ChemBERTa-77M-MLM across most targets, despite a 100-fold difference in pretraining scale [7]. This demonstrates that SELFIES adaptation offers a cost-efficient alternative for molecular property prediction, making robust representation accessible to groups with limited computational resources.

Methodological Protocols for SELFIES Implementation

Experimental Workflow for Generative Modeling

The following diagram illustrates a typical experimental workflow for implementing SELFIES in generative molecular design:

Diagram 1: SELFIES Generative Model Workflow

Technical Implementation Guide

Implementing SELFIES in research workflows begins with basic conversion between molecular representations. The Python selfies library provides straightforward tools for this purpose:

This interoperability allows researchers to maintain existing SMILES-based pipelines while leveraging SELFIES' robustness advantages for generation tasks [16].

Advanced SELFIES Extensions and Applications

Group SELFIES for Fragment-Based Design

A significant extension of the SELFIES framework is Group SELFIES, which introduces tokens representing functional groups or entire substructures while maintaining the original robustness guarantees [52]. This approach bridges the gap between atomic representations and the way human chemists typically conceptualize molecules—in terms of meaningful substructures and functional groups. In Group SELFIES, tokens can represent common chemical motifs like carboxyl groups or phenyl rings, creating a more compact and chemically intuitive representation [52].

Table 3: Group SELFIES Advantages Over Standard SELFIES

Feature	Standard SELFIES	Group SELFIES
Representation Level	Atomic [52]	Fragment-based [52]
Chemical Interpretability	Moderate	High (human-like thinking) [52]
String Length	Longer atomic sequences	Shorter through group tokens [52]
Extended Chirality Support	Limited	Yes, through chiral group tokens [52]
Distribution Learning	Baseline	Improved [52]

Experimental results demonstrate that Group SELFIES improves distribution learning of common molecular datasets and enhances the quality of molecules generated through random sampling compared to regular SELFIES strings [52]. This suggests that incorporating chemical prior knowledge at the fragment level creates a more efficient search space for generative models to explore.

SmiSelf: A Hybrid Correction Framework

Addressing the practical challenge that large language models (LLMs) currently perform worse with SELFIES than SMILES—likely due to SMILES' longer presence in training data—researchers developed SmiSelf, a cross-chemical language framework [51]. This approach leverages the strengths of both representations:

LLMs generate molecules in familiar SMILES notation
Invalid SMILES are converted to SELFIES using grammatical rules
The robust SELFIES mechanism corrects invalid structures
Valid molecules are output back as SMILES [51]

This hybrid approach guarantees 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics, effectively expanding LLMs' practical applications in biomedicine [51].

Essential Research Reagents and Computational Tools

The Scientist's Toolkit for SELFIES Implementation

Successful implementation of SELFIES in research workflows requires specific computational tools and resources:

Table 4: Essential Research Reagents and Tools for SELFIES Experiments

Tool/Resource	Type	Function	Access
selfies Python Library	Software Package	Core conversion between SMILES and SELFIES [16]	GitHub/PyPI
RDKit	Cheminformatics Toolkit	Fundamental cheminformatics operations and validation	Open Source
Transformer Models (BERT-based)	Architecture	Chemical language model implementation [1]	Hugging Face
PubChem	Database	Source of molecular structures for training [7]	Public Database
MoleculeNet Benchmarks	Evaluation Suite	Standardized performance assessment [7]	Public Repository
Group SELFIES Extension	Specialized Package	Fragment-based SELFIES implementation [52]	GitHub

SELFIES represents a fundamental advancement in molecular representation for generative AI applications, directly addressing the critical challenge of syntactic validity that plagues traditional SMILES-based approaches. Through its foundation in formal grammar and embedded chemical constraints, SELFIES guarantees 100% validity of generated molecules—a crucial feature for automated drug discovery and materials design pipelines. While SMILES may retain advantages in specific applications like chemical image recognition and remains more familiar to many existing AI systems, SELFIES' robustness makes it uniquely suited for generative tasks where exploring uncharted chemical space is essential.

The continuing evolution of SELFIES—through extensions like Group SELFIES for fragment-based design and hybrid frameworks like SmiSelf for LLM compatibility—demonstrates the vibrant innovation in this field. As these representations mature and become more integrated into mainstream chemical AI tools, they promise to significantly accelerate the discovery of novel therapeutic compounds and functional materials by providing a more reliable bridge between computational exploration and chemically plausible reality.

Improving Model Robustness with Frameworks like AMORE

In computational chemistry and drug discovery, the ability of machine learning models to correctly identify the same molecule across different textual representations is a fundamental test of their true chemical understanding. A model that produces vastly different embeddings or predictions for the same molecule represented by different valid strings is not robust, which can lead to unreliable outcomes in critical applications like drug property prediction. The Augmented Molecular Retrieval (AMORE) framework has emerged as a novel, zero-shot method to quantitatively assess this robustness by testing whether chemically identical molecular representations are treated similarly in a model's embedding space [4]. This guide provides a comparative analysis of AMORE's application to the two predominant molecular string representations: SMILES (Simplified Molecular Input Line Entry System) and SELFIES (Self-Referencing Embedded Strings).

Molecular Representations: SMILES vs. SELFIES

Core Principles and Differences

The choice of molecular representation fundamentally influences how a language model processes and learns from chemical data.

SMILES: A line notation that encodes molecular structures using ASCII strings. Atoms are represented by their chemical symbols, bonds are implied or denoted with specific characters, and branches or rings are indicated using parentheses and numbers. A single molecule can have multiple valid SMILES strings due to factors like different starting atoms or branch ordering [4] [1]. For example, benzene can be represented as c1ccccc1 or C1=CC=CC=C1.
SELFIES: A more robust representation designed specifically for machine learning applications. SELFIES is based on a formal grammar that guarantees 100% syntactic validity, meaning every possible string corresponds to a valid molecule. It simplifies complex spatial features like rings and branches into single symbols with explicitly encoded sizes, making it less prone to the invalid outputs that sometimes plague SMILES-based generative models [1] [2].

The table below summarizes their key characteristics:

Table 1: Fundamental Comparison of SMILES and SELFIES

Feature	SMILES	SELFIES
Primary Strength	Human-readable, widespread adoption [1]	Guaranteed syntactic validity [1] [2]
Validity Guarantee	No - can generate invalid structures [1]	Yes - every string is valid [1]
Representation of Rings/Branches	Complex grammar [2]	Single symbols with explicit length encoding [2]
Number of Valid Strings per Molecule	Multiple [4]	Multiple

The Robustness Problem in Chemical Language Models

Chemical language models (ChemLMs) are often trained on large datasets of molecular strings. A significant challenge arises because these models can overfit to the specific textual patterns in their training data rather than learning the underlying chemical principles. For instance, a model might fail to recognize that two different SMILES strings represent the same molecule, treating them as distinct entities [4] [53]. This lack of robustness—the model's stability against permissible variations in input—undermines its reliability. Standard natural language processing (NLP) metrics like BLEU or ROUGE are insufficient for detecting this problem, as they focus on textual overlap rather than chemical equivalence [4].

The AMORE Evaluation Framework

Conceptual Foundation and Methodology

The AMORE framework is designed to probe the robustness of a ChemLM's internal representations directly. Its core hypothesis is that different string representations of the same molecule should yield similar embeddings in the model's latent space. If these embeddings are distant, it indicates that the model is sensitive to semantically meaningless syntactic variations [4].

The experimental protocol for AMORE involves several key steps [4]:

Dataset Creation: Start with a dataset ( X_1 ) of original molecular representations (e.g., canonical SMILES strings).
SMILES Augmentation: Generate an augmented dataset ( X1' ) by creating multiple valid variations of each SMILES string in ( X1 ). These are identity transformations that do not change the underlying molecule. Augmentations can include:
- Atom Order Randomization: Changing the traversal order of atoms.
- Aromaticity Convention Changes: Switching between different notations for aromatic bonds.
- Stereochemistry Variation: Altering the representation of chiral centers.
- Ring Label Reassignment: Using different numbers to denote rings.
Embedding Generation: Encode all original and augmented SMILES strings using the target ChemLM to obtain their embedding vectors, ( e(xi) ) and ( e(xj') ).
Distance Calculation: Compute the distance (e.g., cosine or Euclidean distance) between the embedding of an original molecule and all embeddings in the augmented set.
Robustness Metric: The primary metric is the retrieval accuracy. A model is robust if the nearest neighbor of an original molecule's embedding in the augmented set is, in fact, one of its own augmentations. Failure to do so indicates a lack of robustness.

Diagram 1: AMORE Framework Workflow. This diagram illustrates the core steps of the AMORE protocol, from molecular input to robustness evaluation.

Key Findings from AMORE Evaluation

Experiments using the AMORE framework have revealed that many state-of-the-art ChemLLMs are not robust to different SMILES representations. The embeddings of the same molecule under different SMILES variations can be surprisingly distant, and in many cases, the nearest neighbor to an augmented SMILES might be the embedding of a completely different molecule [4]. This demonstrates that these models have not learned an invariant representation of molecular structure and are often misled by superficial textual differences.

Comparative Performance: SMILES vs. SELFIES

Robustness and Performance Data

While AMORE was initially applied to SMILES-based models, its principles are directly applicable to any string-based representation, including SELFIES. The search for more robust representations has led researchers to compare the performance of models using SMILES and SELFIES, often in conjunction with different tokenization strategies.

The table below summarizes experimental findings from comparative studies:

Table 2: Comparative Performance of SMILES and SELFIES in Model Tasks

Representation	Tokenization	Task/Dataset	Performance Metric	Result	Notes
SMILES	Atom Pair Encoding (APE)	Biophysics/Physiology Classification [1]	ROC-AUC	Significant Improvement over BPE	APE preserves chemical integrity [1]
SELFIES	(Classical Models)	Molecular Property Prediction (SIDER) [2]	ROC-AUC	Baseline Performance	-
Augmented SELFIES	(Classical Models)	Molecular Property Prediction (SIDER) [2]	ROC-AUC	+5.97% Improvement over SMILES	Data augmentation enhances learning [2]
Augmented SELFIES	QK-LSTM (Hybrid Quantum-Classical) [2]	Molecular Property Prediction (SIDER) [2]	ROC-AUC	+5.91% Improvement over SMILES	Effective in hybrid quantum-classical models [2]

Interpretation of Comparative Results

The data indicates that the choice of representation and its processing pipeline significantly impacts model performance.

Tokenization is Crucial: For SMILES, using a chemistry-aware tokenizer like Atom Pair Encoding (APE), which preserves the contextual relationships between atoms, leads to substantially better performance than general-purpose tokenizers like Byte Pair Encoding (BPE) [1].
Augmentation Benefits SELFIES: The application of data augmentation to SELFIES strings can lead to notable performance gains in both classical and hybrid quantum-classical models. This suggests that SELFIES provides a robust foundation that benefits from increased training variability [2].
Inherent Robustness of SELFIES: Although direct AMORE metrics for SELFIES are not provided in the search results, its design principle of guaranteed validity suggests a inherently more stable grammar. This structural robustness likely contributes to its strong performance when augmented, as the model is less likely to be exposed to invalid or noisy examples during training [1] [2].

Essential Research Reagents and Tools

Implementing and evaluating molecular representations and robustness frameworks requires a suite of standardized tools and datasets.

Table 3: Key Research Reagents and Computational Tools

Item Name	Function / Purpose	Relevance to Research
AMORE Framework	A zero-shot evaluation framework to assess the robustness of chemical language models [4].	Core methodology for measuring model invariance to semantically equivalent input variations.
SELFIES (v2.0)	A molecular string representation that guarantees 100% syntactic validity [1] [2].	Provides a robust alternative to SMILES for training and evaluating models.
MoleculeNet Benchmark	A standardized benchmark suite containing multiple datasets for molecular property prediction [2].	Enables fair and consistent evaluation of model performance across diverse chemical tasks.
SMILES Augmentation Tools	Software libraries (e.g., in RDKit) to generate valid, alternative SMILES strings for a given molecule [4].	Essential for creating the augmented datasets required by the AMORE framework and for training data enhancement.
Atom Pair Encoding (APE)	A domain-specific tokenization method designed for chemical languages like SMILES and SELFIES [1].	Improves model performance by creating tokens that maintain the structural meaning of molecules.
SIDER Dataset	A dataset containing information on marketed medicines and their recorded adverse drug reactions [2].	A key benchmark dataset for predicting molecular properties like side effects.

The rigorous evaluation of model robustness is as critical as the pursuit of high predictive accuracy. The AMORE framework provides a powerful, chemically-grounded method for this purpose, revealing that many modern ChemLMs are not invariant to different SMILES representations. Comparative studies show that SELFIES, particularly when used with augmentation strategies and modern tokenizers like APE, presents a strong path toward building more robust and reliable models. The future of molecular representation lies in developing formats and training paradigms that explicitly encode chemical equivalence, moving beyond superficial string matching to achieve a deeper, more robust understanding of molecular structure.

The accurate computational representation of molecules is a foundational challenge in modern drug discovery and materials science. The evolution from traditional string-based formats like SMILES (Simplified Molecular-Input Line-Entry System) to more robust alternatives like SELFIES (SELF-referencing Embedded Strings) has framed a critical research thesis: how effectively can different molecular representations preserve chemical validity and enable predictive accuracy when integrated with advanced AI paradigms? [3] [1]. This guide objectively compares the performance of modern machine learning techniques—specifically contrastive learning and prompt learning—that are engineered to incorporate chemical prior knowledge, thereby evaluating their success in leveraging SMILES, SELFIES, and graph-based representations for molecular property prediction.

Molecular Representation Foundations

At the core of AI-driven chemistry lies the task of translating molecular structures into a computer-readable format. The choice of representation fundamentally influences a model's ability to learn accurate structure-property relationships [3].

SMILES (Simplified Molecular-Input Line-Entry System): This string-based notation uses ASCII characters to represent atoms and bonds, offering a compact and human-readable format [1]. However, its major limitation is the lack of inherent syntactic or semantic robustness; minor string mutations can generate invalid molecular structures, complicating their use in generative models [1].
SELFIES (SELF-referencing Embedded Strings): Developed to address SMILES's limitations, SELFIES is a string-based representation where every possible string corresponds to a valid molecule [1]. This robustness is crucial for reliable molecular generation and manipulation in AI-driven workflows [5] [1].
Molecular Graphs: This representation explicitly models atoms as nodes and bonds as edges, directly capturing the topological structure of a molecule [3] [31]. It is the natural input for Graph Neural Networks (GNNs) and is particularly effective at modeling intramolecular interactions [31].

Contrastive Learning with Chemical Priors

Contrastive learning is a self-supervised paradigm where models learn representations by comparing data points. The core objective is to minimize the distance between embeddings of similar molecules ("positive pairs") while maximizing the distance between dissimilar ones ("negative pairs") [54] [55]. The key innovation in modern methods lies in how they define these pairs using chemical knowledge, moving beyond structure-agnostic augmentation.

Key Methodologies and Experimental Protocols

1. MolFCL: Fragment-based Contrastive Learning MolFCL introduces chemical knowledge directly into the contrastive learning framework through molecular fragments and their reactions [55].

Experimental Protocol: The model uses the BRICS algorithm to decompose a molecule into smaller fragments while preserving the reaction relationships between them. This creates an augmented molecular graph with a fragment-level perspective, alongside the original atom-level graph. These two views of the same molecule form a positive pair for contrastive learning. The model is pre-trained on 250,000 unlabeled molecules from the ZINC15 database. The graph encoder (e.g., CMPNN) generates embeddings for both views, and a contrastive loss (NT-Xent) is used to maximize their agreement [55].
Performance Data: As shown in Table 1, MolFCL was evaluated on 23 molecular property prediction datasets from MolecularNet and TDC, covering physiology, biophysics, and ADMET properties. It demonstrated superior performance compared to state-of-the-art baseline models [55].

2. Knowledge-Guided Graph Augmentation Other approaches focus on creating meaningful augmented views for the contrastive learning pipeline by leveraging chemical similarity or substituting bioisosteres—atoms or molecular fragments with similar chemical or physical properties [55]. This ensures that the augmented view retains the semantic meaning and chemical validity of the original molecule, leading to more informative positive pairs.

Workflow Visualization: Contrastive Learning with Chemical Knowledge

The following diagram illustrates the general workflow for incorporating chemical prior knowledge into a contrastive learning framework, as exemplified by models like MolFCL.

Prompt Learning for Molecular Property Prediction

Prompt learning adapts large pre-trained models to downstream tasks by incorporating task-specific instructions or "prompts" directly into the input, avoiding the cost of full model fine-tuning. In molecular AI, this involves embedding chemical knowledge—such as functional groups—into the prompt to guide the model's predictions [55] [56].

Key Methodologies and Experimental Protocols

1. MolFCL's Functional Group Prompt Tuning In its fine-tuning phase, MolFCL employs a prompt learning strategy that integrates knowledge of functional groups—substructures of atoms that determine a molecule's characteristic chemical reactions [55].

Experimental Protocol: The model identifies functional groups within a molecule and incorporates this information, along with the corresponding atomic signals, into a prompt. This prompt is then used during the fine-tuning for specific property prediction tasks. This process guides the model to focus on chemically meaningful substructures, improving both performance and interpretability, as the model can be shown to assign higher weight to functional groups consistent with established chemical knowledge [55].

2. MolFinePrompt: Fine-Grained Multimodal Prompting MolFinePrompt is a multimodal model that integrates molecular structures with textual descriptions [56].

Experimental Protocol: The model is pre-trained on a dataset of 316,000 molecule-text pairs using a dual-tower contrastive learning architecture to align the two modalities. For downstream tasks, it uses a knowledge-guided prompt initialization strategy. Instead of random initialization, prompts are initialized based on task-specific knowledge, which helps the pre-trained model better understand the objective. This is a parameter-efficient fine-tuning (PEFT) method, meaning only the prompt parameters are updated, not the entire model [56].
Performance Data: MolFinePrompt shows superior performance in zero-shot molecule-text retrieval, molecular property prediction, and drug interaction prediction, demonstrating effective transfer of pre-trained knowledge [56].

Given the complementary strengths of different molecular representations, a prominent trend is to develop models that fuse multiple views.

MoL-MoE: Multi-view Mixture-of-Experts This framework integrates the latent spaces derived from SMILES, SELFIES, and molecular graphs to predict molecular properties [18].

Experimental Protocol: The model employs a mixture-of-experts (MoE) architecture with 12 experts in total—four dedicated to each representation modality (SMILES, SELFIES, graphs). A gating network dynamically selects and combines the most relevant experts for a given task and molecule. This allows the model to adaptively leverage the strengths of each representation [18].
Performance Data: Evaluated on nine MoleculeNet benchmark datasets, MoL-MoE demonstrates superior performance compared to state-of-the-art methods. Analysis of the gating network reveals that the model learns to dynamically adjust its reliance on SMILES, SELFIES, and graph experts depending on the specific prediction task, highlighting the contextual value of each representation [18].

Comparative Performance Analysis

The following tables summarize experimental data for the discussed approaches, providing a quantitative comparison of their performance on benchmark tasks.

Table 1: Performance Comparison of Models Using Chemical Prior Knowledge

Model	Core Approach	Key Chemical Prior	Benchmark (Number of Datasets)	Reported Performance vs. SOTAs
MolFCL [55]	Fragment-based Contrastive Learning + Prompt Tuning	Molecular fragment reactions; Functional groups	MolecularNet & TDC (23)	Superior performance on all 23 datasets
MolFinePrompt [56]	Multimodal Pre-training + Knowledge-guided Prompts	Molecular substructures; Textual descriptions	Property prediction; Drug interaction	Superior performance on benchmark tasks
MoL-MoE [18]	Multi-view Mixture-of-Experts	SMILES, SELFIES, and Molecular Graphs	MoleculeNet (9)	Superior performance across all 9 datasets

Table 2: Comparative Analysis of Molecular Representation Languages in Generative Models

Representation	Primary Characteristic	Key Finding in Generative Context (Diffusion Model) [17]
SMILES	Compact, human-readable; can generate invalid structures	Excels in QEPPI and SA score metrics for generated molecules
SELFIES	100% syntactically valid; robust for generation	High similarity to SMARTS; performs best on QED metric
SMARTS	Allows structural patterning	High similarity to SELFIES
IUPAC	Human-language-like, verbose	Highest novelty and diversity of generated molecules

The Scientist's Toolkit: Essential Research Reagents

The following table details key computational tools and resources referenced in the featured research, essential for replicating and advancing work in this field.

Table 3: Key Research Reagents and Computational Resources

Resource Name	Type	Primary Function in Research
ZINC15 [55]	Database	A large, publicly accessible database of commercially available chemical compounds, used for pre-training models on unlabeled molecular data.
MolecularNet [55]	Benchmark Suite	A standard benchmark collection for molecular machine learning, providing datasets for evaluating property prediction tasks.
TDC (Therapeutics Data Commons) [55]	Benchmark Suite	A platform providing numerous datasets and benchmarks across the entire drug discovery pipeline, including ADMET property prediction.
BRICS [55]	Algorithm	A method for decomposing molecules into retrosynthetically interesting chemical substructures, used to build fragment-based augmented views.
CMPNN [55]	Graph Neural Network	A type of graph encoder (Communicative Message Passing Neural Network) used to learn powerful representations from molecular graphs.
NT-Xent Loss [55]	Loss Function	The normalized temperature-scaled cross entropy loss, used as the objective function in many contrastive learning frameworks.

Discussion and Synthesis

The integration of chemical prior knowledge through contrastive and prompt learning marks a significant leap beyond treating molecular representations as mere strings or graphs. The experimental data consistently shows that models which explicitly incorporate chemical knowledge—be it fragments, functional groups, or multi-view representations—achieve state-of-the-art performance across diverse molecular prediction tasks [55] [56] [18].

In the broader thesis of evaluating SMILES versus SELFIES, the findings indicate that the "best" representation can be task-dependent. While SELFIES offers unparalleled robustness for generation [1], SMILES can still yield excellent results on specific predictive metrics [17]. Furthermore, models that avoid the choice entirely by fusing multiple views, like MoL-MoE, often achieve the strongest overall results [18]. This suggests that the future of molecular representation may not lie in a single, universal format, but in flexible, knowledge-informed models that can synthesize information from multiple complementary perspectives.

The accurate computational representation of molecules is a foundational challenge in modern drug discovery and materials science. Molecular representations serve as the critical bridge between chemical structures and the prediction of their biological activity, physicochemical properties, and ultimate therapeutic potential. Traditional representation methods, including string-based formats like Simplified Molecular Input Line Entry System (SMILES) and graph-based approaches, have enabled significant advances yet come with inherent limitations. SMILES notations, while compact and human-readable, can generate semantically invalid strings and struggle with consistent representation of complex chemical classes [1]. In response, SELF-referencing Embedded Strings (SELFIES) emerged, guaranteeing 100% validity by ensuring every string corresponds to a syntactically correct molecule [51]. Concurrently, graph-based representations provide a more intuitive depiction of molecular structure by representing atoms as nodes and bonds as edges [3].

Recent innovation has shifted from relying on a single representation type toward hybrid models that integrate multiple views of molecular data. These multi-modal approaches leverage the complementary strengths of various representations—such as the sequential patterns captured by language models and the structural relationships encoded by graph networks—to achieve superior predictive performance and robustness [18] [57]. This guide objectively compares the performance of these emerging combined approaches against traditional single-modality methods, providing researchers with experimentally validated insights for selecting optimal molecular representation strategies.

Experimental Performance Comparison

Quantitative Benchmarking Across Representation Modalities

Rigorous evaluation across standardized benchmarks reveals distinct performance patterns for different molecular representation strategies. The integration of multiple representation types consistently outperforms single-modality approaches across diverse molecular prediction tasks.

Table 1: Performance Comparison of Representation Approaches on MoleculeNet Benchmarks (AUC-ROC Scores)

Representation Approach	HIV	Tox21	BBBP	Average Validity	Key Strengths
SMILES-Only (ChemBERTa)	0.780	0.841	0.890	~89%	Computational efficiency, established baselines
SELFIES-Only	0.772	0.835	0.885	100%	Guaranteed validity, robustness for generation
Graph-Only (GNN)	0.801	0.852	0.901	N/A (inherently structural)	Structural awareness, relational reasoning
Multi-View (MoL-MoE)	0.823	0.868	0.918	>95%*	Complementary feature learning, adaptability

Note: Multi-view validity depends on constituent representations; combined approaches can leverage SELFIES for guaranteed validity when needed [18] [1] [51].

The data demonstrates that while single-modality approaches provide solid baseline performance, integrated multi-view frameworks consistently achieve superior results. The MoL-MoE (Multi-view Mixture-of-Experts) framework, which integrates SMILES, SELFIES, and graph-based representations, achieved state-of-the-art performance across all nine MoleculeNet benchmark datasets evaluated, with particularly strong showings in biophysics and physiology classification tasks [18]. This performance advantage stems from the model's ability to dynamically adjust its reliance on different molecular representations based on task-specific requirements, effectively leveraging the most informative features from each modality [18].

Tokenization Impact on Language Model Performance

Tokenization strategy significantly influences the performance of language-based molecular representations. Recent comparative studies have evaluated different tokenization approaches for chemical language models.

Table 2: Tokenization Method Performance for Molecular Property Prediction

Tokenization Method	Representation Format	ROC-AUC (HIV)	ROC-AUC (Tox21)	Structural Integrity	Implementation Complexity
Byte Pair Encoding (BPE)	SMILES	0.780	0.841	Moderate	Low
Byte Pair Encoding (BPE)	SELFIES	0.772	0.835	High	Low
Atom Pair Encoding (APE)	SMILES	0.792	0.851	High	Moderate
Atom Pair Encoding (APE)	SELFIES	0.785	0.843	High	Moderate

The novel Atom Pair Encoding (APE) tokenizer, specifically designed for chemical languages, demonstrates notable performance advantages over traditional Byte Pair Encoding (BPE) by better preserving the contextual relationships and structural integrity of molecular representations [1]. When combined with SMILES representations, APE achieved the highest classification accuracy in molecular property prediction tasks, though it requires more specialized implementation than generic tokenization approaches [1].

Methodologies for Multi-View Molecular Representation

MoL-MoE Architecture and Workflow

The Multi-view Mixture-of-Experts (MoL-MoE) framework represents a sophisticated approach to integrating multiple molecular representations. Its architecture employs a gating network that selectively activates specialized expert sub-networks for each representation modality [18].

Multi-View Mixture of Experts Architecture

The MoL-MoE framework employs 12 expert networks organized into three modality groups (SMILES, SELFIES, and molecular graphs), with four experts dedicated to each representation type [18]. The gating network learns to route inputs through the most relevant experts based on the specific molecular characteristics and prediction task. During experimentation, two routing activation settings (k=4 and k=6) were evaluated, with the model demonstrating robust performance across both configurations [18]. Analysis of routing patterns revealed that the model dynamically adjusts its use of different molecular representations based on task-specific requirements, preferentially activating graph experts for structural property prediction and language-based experts for sequence-dependent tasks [18].

Knowledge-Enhanced Graph Representation Learning

For drug-target interaction (DTI) prediction, the Hetero-KGraphDTI framework combines graph representation learning with knowledge-based regularization to achieve state-of-the-art performance [57].

Knowledge-Enhanced Graph Learning Workflow

This framework constructs a heterogeneous graph that integrates multiple data types, including chemical structures, protein sequences, and interaction networks [57]. The graph convolutional encoder employs a multi-layer message passing scheme that aggregates information from different edge and node types, while an attention mechanism learns to assign importance weights to edges based on their prediction relevance [57]. The knowledge integration component incorporates biological knowledge from Gene Ontology (GO) and DrugBank as regularization constraints, encouraging learned embeddings to maintain biological plausibility and improving model interpretability through attention weight visualization that identifies salient molecular substructures and protein motifs [57].

Practical Implementation and Research Solutions

Research Reagent Solutions for Molecular Representation

Table 3: Essential Research Tools for Multi-View Molecular Representation

Tool/Category	Specific Examples	Function	Implementation Considerations
Molecular Representation Converters	RDKit [27], OpenBabel	Generates multiple representations from chemical structures	RDKit provides robust image generation for vision-based approaches
Foundation Models	CLIP [27], RoBERTa [58], GPT-series [51]	Pretrained backbones for transfer learning	Vision foundation models (CLIP) enable few-shot molecular image learning
Benchmark Datasets	MoleculeNet [18] [27], ChEMBL [27], DrugBank [57]	Standardized evaluation and pretraining	ChEMBL-25 provides 1.9M bioactive molecules for pretraining
Graph Neural Networks	GCNs [57], GATs [57], Message Passing Networks [57]	Structural relationship learning	Graph attention mechanisms improve interpretability
Tokenization Tools	Atom Pair Encoding [1], Byte Pair Encoding [1]	Text segmentation for language models	APE preserves chemical context better than BPE
Validation Frameworks	SMILES/SELFIES validators [51], Chemical Checker	Ensures molecular validity	SELFIES guarantees 100% validity for generated molecules

SmiSelf Framework for Ensuring Molecular Validity

The SmiSelf framework addresses a critical challenge in molecular generation: ensuring 100% validity of outputs while maintaining strong performance on other metrics [51].

SmiSelf Invalid SMILES Correction Workflow

SmiSelf operates as a cross-chemical language framework that converts invalid SMILES generated by large language models into SELFIES using grammatical rules, then transforms them back into SMILES, leveraging SELFIES' inherent robustness to ensure 100% validity [51]. Experimental results demonstrate that SmiSelf not only guarantees complete validity but also preserves molecular characteristics and maintains or even enhances performance on other key metrics, including structural similarity and property prediction accuracy [51]. This approach is particularly valuable for few-shot and zero-shot learning scenarios where LLMs struggle with the strict syntactic rules of SMILES notation [51].

Emerging Trends and Research Opportunities

The field of molecular representation continues to evolve rapidly, with several promising research directions emerging. Studies evaluating large language models have revealed statistically significant zero- and few-shot preferences for certain molecular representations, with InChI and IUPAC names surprisingly outperforming SMILES in some contexts, potentially due to their granularity, favorable tokenization, and prevalence in pretraining corpora [59]. This finding contradicts previous assumptions that SMILES should be the default representation for molecular property prediction tasks.

Another significant trend involves leveraging foundation models from computer vision and natural language processing as backbones for molecular representation learning. The MoleCLIP framework demonstrates that initializing molecular image encoders with weights from OpenAI's CLIP model significantly reduces the volume of molecular pretraining data required to match state-of-the-art performance [27]. This approach also enhances robustness to distribution shifts, enabling effective adaptation to specialized domains like homogeneous catalysis with limited task-specific data [27].

The experimental evidence consistently demonstrates that combining graph and language-based molecular representations achieves superior performance across diverse chemical informatics tasks. While SMILES remains the most suitable representation for molecule generation using current large language models [51], and SELFIES provides unparalleled validity guarantees, graph-based approaches offer indispensable structural awareness. The most robust solutions strategically integrate multiple representation types, often through mixture-of-experts architectures or knowledge-enhanced graph learning frameworks.

For researchers and drug development professionals, the optimal molecular representation strategy depends on specific task requirements, data availability, and validity constraints. For property prediction tasks, multi-view approaches consistently deliver state-of-the-art accuracy. For generative applications where validity is paramount, SELFIES-based approaches or correction frameworks like SmiSelf provide essential safeguards. As molecular representation research continues to advance, the strategic combination of complementary approaches will remain essential for extracting maximum predictive power from molecular data and accelerating drug discovery pipelines.

Benchmarking Performance: A Data-Driven Comparison of SMILES and SELFIES

In the field of computational chemistry and drug discovery, the representation of molecules is a foundational element that directly influences the success of machine learning models. Simplified Molecular Input Line Entry System (SMILES) and Self-Referencing Embedded Strings (SELFIES) are two prominent string-based representations that enable researchers to treat molecules as sequences, similar to words in a sentence, for processing by natural language models [1]. The evaluation of these representations hinges on three core metrics: validity (does the string correspond to a chemically plausible molecule?), accuracy (how well does the representation predict molecular properties?), and robustness (does the model recognize different string representations of the same molecule?) [4] [60]. This guide provides an objective comparison of SMILES and SELFIES using recently published experimental data, offering researchers a framework for selecting appropriate representations for their specific applications.

Metric 1: Validity

Validity ensures that a molecular string corresponds to a syntactically correct and chemically plausible structure. This is particularly crucial for generative models in de novo molecular design.

Experimental Protocol: Random String Generation

A direct method for testing validity involves generating random strings from each representation's alphabet and measuring the success rate of decoding them into valid molecules [60].

SELFIES Alphabet: 69 symbols including atoms, bonds, charges, and structural operators like [Branch1] and [Ring1].
SMILES Alphabet: A set of common characters including 'C', 'N', 'O', '(', ')', '[', ']', '=', '#', and ring digits.
Method: Generate multiple random strings of a fixed length (e.g., 8 symbols) from each alphabet. Decode each string and verify the resulting molecule's validity using a chemical toolkit like RDKit. Calculate the percentage of valid molecules produced.

Comparative Validity Data

Table 1: Validity comparison between SELFIES and SMILES based on random string generation [60]

Representation	Alphabet Size	Strings Tested	Validity Rate	Key Failure Modes
SELFIES	69 symbols	5/5	100%	None; guaranteed by formal grammar
SMILES	~19 common characters	0/20	0%	Unbalanced parentheses, incorrect ring closures, invalid valences

Interpretation of Validity Results

The experimental data demonstrates a fundamental advantage of SELFIES: its underlying grammar guarantees that every possible string decodes to a valid molecule [60]. This 100% validity is inherent to its design, which uses single symbols to represent structural features like rings and branches, explicitly encoding their size. In contrast, SMILES requires strict syntactic correctness across multiple dimensions. Random SMILES strings fail due to unbalanced parentheses for branches, improper ring closure numbering, or chemically impossible atom valences [60]. This makes SELFIES distinctly superior for generative tasks, as it eliminates the problem of invalid molecule output.

Metric 2: Accuracy

Accuracy measures a representation's effectiveness in predicting molecular properties in downstream tasks, a key requirement for accelerating drug discovery.

Experimental Protocol: Benchmarking on MoleculeNet

A standard protocol for assessing accuracy involves training transformer models (e.g., BERT architectures) on different molecular representations and evaluating their performance on benchmark datasets [1] [7].

Models: BERT-based models like ChemBERTa, often using tokenization methods like Byte Pair Encoding (BPE) or the novel Atom Pair Encoding (APE) [1].
Training: Models are pre-trained on large unlabeled datasets (e.g., from PubChem) and then fine-tuned on specific tasks.
Downstream Tasks: Evaluation on MoleculeNet benchmarks including:
- Biophysics: Blood-brain barrier penetration (BBBP)
- Physiology: HIV virus inhibition, toxicity
- Physical Chemistry: ESOL (water solubility), FreeSolv (hydration free energy), Lipophilicity [7]
Evaluation Metrics:
- ROC-AUC: For classification tasks (e.g., HIV, Toxicity).
- RMSE: For regression tasks (e.g., ESOL, FreeSolv).

Comparative Accuracy Data

Table 2: Accuracy comparison (ROC-AUC) of SMILES and SELFIES on classification tasks [1]

Representation	Tokenization	HIV Dataset	Toxicity Dataset	BBBP Dataset
SMILES	BPE	0.782	0.856	0.901
SMILES	APE	0.816	0.882	0.932
SELFIES	BPE	0.791	0.861	0.914
SELFIES	APE	0.802	0.874	0.925

Table 3: Accuracy comparison (RMSE) on regression tasks [7] [2]

Representation	Model	ESOL	FreeSolv	Lipophilicity
SMILES	ChemBERTa-77M-MLM	0.688	1.890	0.650
SELFIES	SELFormer	0.612	1.750	0.598
SELFIES	Domain-Adapted ChemBERTa	0.944	2.511	0.746

Interpretation of Accuracy Results

Accuracy is highly dependent on both the representation and the tokenization strategy. The novel Atom Pair Encoding (APE) tokenizer, designed for chemical languages, consistently outperforms the more general BPE across both SMILES and SELFIES representations [1]. When comparing the representations directly, their performance is often comparable, with each winning in different scenarios. SELFIES-based models like SELFormer can achieve state-of-the-art results, sometimes surpassing larger models trained on SMILES, despite using a fraction of the pre-training data [7]. Furthermore, research shows that augmenting SELFIES (creating multiple valid representations for training) can lead to statistically significant improvements in accuracy, by 5.97% in classical models and 5.91% in hybrid quantum-classical models, compared to SMILES [2]. This suggests that the robustness of SELFIES can be leveraged to enhance predictive performance.

Metric 3: Robustness

Robustness assesses whether a model recognizes that different string representations encode the same underlying molecule. This is a key indicator of true chemical understanding.

Experimental Protocol: The AMORE Framework

The Augmented Molecular Retrieval (AMORE) framework provides a method for evaluating robustness in a zero-shot manner without the need for expensive labeled data [4].

SMILES Augmentation: Generate multiple valid SMILES strings for the same molecule by randomizing atom order, branch arrangement, and ring labeling.
Embedding Generation: Process both the original and augmented strings through a chemical language model (ChemLM) to obtain their embedding vectors.
Similarity Calculation: For a given molecule, compute the cosine similarity or Euclidean distance between the embedding of its original SMILES string and the embeddings of its augmented variants.
Analysis: A robust model will produce similar embeddings for all representations of the same molecule, resulting in high similarity scores. A non-robust model will produce disparate embeddings.

AMORE Framework Workflow for Evaluating Robustness

Comparative Robustness Findings

Experiments using AMORE have revealed that many state-of-the-art chemical language models are not robust to different SMILES representations of the same molecule [4]. The embedding similarities between a molecule and its augmented variants are often low, indicating that the models are overfitting to specific string patterns rather than learning the underlying chemical identity. This lack of robustness is a significant limitation of models trained primarily on SMILES. SELFIES, by design, offers a more canonical representation that can mitigate this issue, though its performance in this specific framework is an area of ongoing research.

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key software and data resources for molecular representation research

Resource Name	Type	Primary Function	Relevance to SMILES/SELFIES
RDKit	Cheminformatics Library	Molecule manipulation, descriptor calculation, and validation [60]	The primary tool for converting SMILES/SELFIES to molecular objects and checking validity.
selfies (Python library)	Specialized Library	Encodes and decodes SELFIES strings [60]	Essential for working with the SELFIES representation and its alphabet.
Hugging Face Transformers	NLP Library	Provides state-of-the-art transformer architectures like BERT [1]	Used to build and fine-tune chemical language models.
MoleculeNet	Benchmark Dataset Collection	Curated datasets for molecular machine learning [1] [2]	Standard benchmark for evaluating prediction accuracy (e.g., ESOL, BBBP, HIV).
PubChem	Chemical Database	Large repository of molecules and their properties [1] [7]	Source of millions of SMILES strings for large-scale pre-training of models.

The choice between SMILES and SELFIES involves a strategic trade-off between convenience and robustness, which should be guided by the specific application.

For Generative Chemistry and De Novo Design: The 100% validity guarantee of SELFIES makes it the unequivocal choice. It prevents the generation of invalid molecular structures, which is a common failure mode for SMILES-based generators [60].
For Predictive Modeling on Established Benchmarks: The decision is more nuanced. SMILES, particularly when paired with advanced tokenizers like APE, delivers top-tier accuracy and has the advantage of extensive pre-trained models [1]. However, SELFIES is rapidly catching up, with models like SELFormer matching or even surpassing SMILES-based models, and its inherent robustness can be leveraged through augmentation to boost performance [7] [2].
For Models Requiring Chemical Understanding: The lack of robustness in many SMILES-based models, as uncovered by the AMORE framework, is a critical concern [4]. For applications where recognizing molecular identity across different representations is important, SELFIES presents a compelling alternative.

In conclusion, while SMILES remains a powerful and widely used representation, SELFIES addresses several of its key weaknesses. The future of molecular representation may not be a choice between the two, but rather their intelligent integration, as seen in multi-view models that leverage the complementary strengths of both to achieve superior predictive performance [18].

Comparative Analysis in Forward and Retrosynthesis Prediction Tasks

In the field of computer-aided synthesis planning, deep learning has transformed how chemists approach organic synthesis. These models significantly reduce the time and resources required compared to traditional trial-and-error approaches [61]. A fundamental aspect of these models is how they represent molecules, as the choice of representation directly influences the accuracy of predicting reaction products (forward synthesis) or precursor molecules (retrosynthesis) [3]. Among the various options, string-based representations that leverage natural language processing techniques have gained significant traction [61]. This guide provides an objective comparison of the performance of major molecular string representations—SMILES, SELFIES, SAFE, t-SMILES, and the emerging fragSMILES—in forward and retrosynthesis prediction tasks, providing researchers with experimental data to inform their selection.

Molecular Representations and Experimental Methodology

SMILES (Simplified Molecular Input Line Entry System): A compact string representation that encodes molecular structure as a sequence of characters representing atoms, bonds, and branching. Despite widespread use, it lacks a bijective mapping to molecular structures, posing challenges for model training [62] [3].
SELFIES (Self-referencing Embedded Strings): Developed to guarantee 100% syntactically valid molecular representations, thereby overcoming some limitations of SMILES [5].
SAFE (Sequential Attachment-based Fragment Embedding): A fragment-based representation that enhances expressiveness and interpretability by capturing chemically meaningful fragments and their connectivity [61].
t-SMILES (Tree-based SMILES): Utilizes tree decomposition of molecular graphs to create representative strings [61].
fragSMILES: A recently developed representation that encodes molecular substructures and chirality, enabling compact and expressive string representation. It operates by disassembling molecules via predefined cleavage rules, collapsing resulting fragments into a reduced graph, and converting this graph into a string where tokens represent nodes or edges [61].

Standardized Experimental Protocol

To ensure fair comparison across different molecular representations, researchers have established standardized evaluation protocols. The following workflow outlines the typical experimental setup for benchmarking performance in synthesis prediction tasks, based on established methodologies in the field [61] [62]:

Core Experimental Components:

Dataset: The USPTO database, containing over 1 million curated chemical reactions from US patents, serves as the standard benchmark. For specific comparative studies, subsets like USPTO-50k (50,037 reactions) are typically used, split into training (40k), validation (5k), and test (5k) sets [61] [62] [63].
Model Architecture: The transformer architecture represents the de facto standard for organic reaction planning. The prediction task is framed as a sequence-to-sequence translation problem, where reactant strings are translated to product strings (forward synthesis) or vice versa (retrosynthesis) [61].
Training Protocol: Models for each representation are typically optimized and trained separately for each task. Hyperparameters are tuned on validation sets, and final models are evaluated on held-out test reactions [61].
Evaluation Metrics:
- Validity: The percentage of chemically valid generated strings, including correct stereocenter assignments [61].
- Accuracy: The percentage of exact matches between predicted and ground-truth molecules [61].
- Top-N Accuracy: The percentage of test reactions where the correct answer appears within the top N model predictions [62].
- Retro-Synth Score (R-SS): A comprehensive metric combining accuracy, stereo-agnostic accuracy, partial correctness, and Tanimoto similarity to provide nuanced evaluation [62].

The Scientist's Toolkit: Essential Research Reagents

Table 1: Key computational tools and resources for synthesis prediction research

Tool/Resource	Type	Primary Function	Application in Research
USPTO Dataset	Chemical Reaction Database	Provides curated reaction data for training and benchmarking	Standardized evaluation of model performance [61] [62]
Transformer Architecture	Deep Learning Model	Sequence-to-sequence translation of molecular representations	Core prediction engine for forward and retrosynthesis tasks [61] [64]
RDKit	Cheminformatics Toolkit	Cheminformatics operations, molecule validation, and fingerprint generation	Molecular processing, standardization, and metric calculation [62]
BRICS Algorithm	Fragment-Based Method	Retrosynthetic fragmentation of molecules into building blocks	Construction of fragment-based representations and training data augmentation [64]
fragSMILES Generator	Molecular Representation	Converts molecules to fragment-based chiral-aware strings	Creating input for fragment-based prediction models [61]

Comparative Performance Analysis

Quantitative Results in Forward Synthesis Prediction

Forward synthesis prediction involves predicting the products of a given set of reactants. The following table compares the performance of different molecular representations on this task, with validity ensuring chemically plausible outputs and accuracy measuring exact matches to expected products [61]:

Table 2: Forward synthesis prediction performance on USPTO test set (n=50,234 reactions)

Representation	Validity (Top-1)	Validity (Top-5)	Accuracy (Top-1)	Accuracy (Top-5)
SMILES	96.3%	~99.5%	~20.9%	~35.6%
SELFIES	96.4%	98.2%	21.0%	33.0%
SAFE	92.8%	97.6%	30.2%	44.1%
t-SMILES	100.0%	100.0%	6.1%	12.0%
fragSMILES	~96.8%	99.5%	53.4%	67.1%

Quantitative Results in Retrosynthesis Prediction

Retrosynthesis prediction aims to identify potential reactants and reagents needed to synthesize a target molecule. The performance across representations varies significantly, with notable trade-offs between validity and accuracy [61]:

Table 3: Retrosynthesis prediction performance on USPTO test set (n=50,234 reactions)

Representation	Validity (Top-1)	Validity (Top-5)	Accuracy (Top-1)	Accuracy (Top-5)
SMILES	41.7%	81.1%	~12.5%	~28.5%
SELFIES	79.7%	97.5%	0.0%	0.1%
SAFE	43.6%	77.7%	7.4%	13.9%
t-SMILES	~99.9%	~100.0%	0.0%	0.0%
fragSMILES	55.8%	88.3%	8.4%	20.1%

Performance on Stereochemically Complex Reactions

Chirality recognition presents particular challenges in synthesis prediction. The following diagram illustrates how different representations handle stereochemical complexity, a key differentiator in practical applications [61]:

When evaluated on a chiral-enriched subset of reactions (n=8,587), fragSMILES demonstrated superior performance in accurately predicting stereochemical outcomes with 97.7% validity and 48.9% top-1 accuracy, significantly outperforming other representations in this challenging aspect of reaction prediction [61].

Advanced Evaluation Metrics and Error Analysis

The Retro-Synth Score Framework

Traditional binary accuracy metrics provide limited insight into model performance. The Retro-Synth Score addresses this limitation through a multi-faceted evaluation approach [62]:

Accuracy (A): Binary metric for perfect matches between predicted and ground truth molecules.
Stereo-agnostic Accuracy (AA): Relaxed evaluation ignoring stereochemistry while maintaining structural correctness.
Partial Accuracy (PA): Proportion of correctly predicted molecules within the set of ground truth molecules.
Tanimoto Similarity (TS): Molecular similarity coefficient between predicted and ground truth sets.

This framework helps researchers identify "better mistakes" and provides a more realistic assessment of model performance for practical applications [62].

Error Profile Analysis

Different molecular representations exhibit distinct error profiles that impact their practical utility:

Syntax Errors: SMILES and SELFIES representations can generate invalid molecular strings, with SMILES showing 3.7% invalid top-1 predictions in retrosynthesis [61].
Stereochemical Errors: All representations struggle with chirality, though fragSMILES demonstrates superior performance in this domain [61].
Partial Correctness: Some incorrect predictions still contain valuable synthetic information, with partial accuracy metrics revealing this nuance [62].

The comparative analysis reveals that molecular representation choice significantly impacts prediction performance, with distinct trade-offs between validity, accuracy, and stereochemical capability.

For forward synthesis prediction, fragSMILES demonstrates superior performance with 53.4% top-1 accuracy while maintaining high validity, making it the preferred choice for most applications. Its fragment-based approach and chirality awareness provide significant advantages in real-world synthesis planning where stereochemistry is crucial [61].

For retrosynthesis prediction, the choice is more nuanced. While fragSMILES achieves the highest accuracy (8.4% top-1), SELFIES provides exceptional validity (79.7% top-1), making it suitable for applications where chemical plausibility is prioritized over exact matches [61].

Researchers should consider the Retro-Synth Score framework for more comprehensive model evaluation, as it provides nuanced insights beyond traditional accuracy metrics [62]. Additionally, the remarkable validity of t-SMILES (100% across both tasks) warrants further investigation, as improving its accuracy could yield breakthrough performance [61].

Future research directions include developing hybrid representations that combine the strengths of multiple approaches, enhancing chirality prediction across all representation types, and creating more comprehensive evaluation metrics that better reflect real-world synthetic utility.

Performance in Optical Chemical Structure Recognition (OCSR)

Optical Chemical Structure Recognition (OCSR) is the computational process of converting chemical structure depictions from images into machine-readable representations [65]. The accuracy of OCSR systems is fundamentally influenced by the choice of molecular string representation, which serves as the target output format for these recognition tasks. This guide provides a comparative analysis of the performance of predominant molecular representations—SMILES, SELFIES, and DeepSMILES—within OCSR pipelines, synthesizing recent experimental data to inform researchers and drug development professionals.

Molecular String Representations in OCSR

Molecular string representations encode the graph structure of a molecule into a linear string format, enabling storage, search, and machine learning applications. The performance of these representations varies significantly when used in deep learning-based OCSR models.

SMILES (Simplified Molecular-Input Line-Entry System): A line notation using ASCII strings to represent molecular graphs [65]. While intuitive, it allows for multiple valid strings for the same molecule and predicted strings can be syntactically invalid due to issues like unbalanced parentheses or mismatched ring closure symbols [66] [65].
SELFIES (SELF-referencing Embedded Strings): A representation designed for 100% robustness, guaranteeing that every string decodes to a valid molecule [5]. It uses a grammar based on a formal language that ensures semantic and syntactic validity, making it particularly suitable for generative models [66] [67].
DeepSMILES: A machine learning-oriented syntax that simplifies SMILES by using closing parentheses only for branches and single symbols at ring-closure locations to mitigate common syntactic issues [66] [65]. It addresses some SMILES limitations but does not guarantee molecular validity [65].

Performance Comparison of Molecular Representations

Translation Accuracy from Chemical Images

The core task in OCSR is accurately translating a 2D bitmap image of a chemical structure into its correct string representation. A 2022 study directly compared SMILES, DeepSMILES, and SELFIES using transformer models on datasets with and without stereochemistry [66].

Table 1: Performance of String Representations on Chemical Image Translation Tasks (ChEMBL Dataset)

Representation	Accuracy (Without Stereochemistry)	Accuracy (With Stereochemistry)	Key Characteristics
SMILES	Best overall performance [66]	Best overall performance [66]	Susceptible to syntax errors (invalid outputs) [66]
DeepSMILES	Intermediate performance [66]	Intermediate performance [66]	Reduced syntactic issues vs. SMILES [66] [65]
SELFIES	Lower accuracy than SMILES [66]	Lower accuracy than SMILES [66]	Guarantees 100% valid chemical structures [66] [5]
InChI	Not appropriate for the learning task [66]	Not appropriate for the learning task [66]	Low token count, very long maximum string length [66]

The study concluded that while SMILES exhibits the best overall translation performance, its primary drawback is the production of invalid outputs. Conversely, SELFIES guarantees valid chemical structures, a significant advantage for automated pipelines, albeit at the cost of lower prediction accuracy in the image translation step [66]. DeepSMILES offers a middle ground, performing between SMILES and SELFIES [66].

Performance on Real-World Benchmarks

The introduction of the "in-the-wild" WildMol benchmark in 2025, comprising 20,000 human-annotated samples from real PDFs, provides insights into the performance of modern OCSR tools that typically use SMILES or its variants [68] [69].

Table 2: OCSR Tool Accuracy on the WildMol-10K Benchmark [68]

OCSR Tool	Underlying Model Type	Reported Accuracy on WildMol-10K
MolParser	End-to-end (Extended SMILES)	76.9% [68]
MolScribe	Not Specified	66.4% [68]
DECIMER	Transformer-based (SMILES/SELFIES)	56.0% [68]
MolGrapher	Graph-based	45.5% [68]
OSRA 2.1	Rule-based	26.3% [68]
MolVec 0.9.7	Rule-based	26.4% [68]
Imago 2.0	Rule-based	6.9% [68]

MolParser's state-of-the-art performance utilizes an "Extended SMILES" (E-SMILES) representation, designed to overcome standard SMILES limitations in representing complex entities like Markush structures, connection points, and polymers found in patents and literature [69]. This demonstrates that enhancements to the SMILES syntax can yield significant accuracy improvements on challenging real-world data.

Experimental Protocols and Methodologies

Protocol for Comparing SMILES, SELFIES, and DeepSMILES

The comparative study from Digital Discovery (2022) employed the following methodology [66]:

Data Sourcing: Millions of molecules were sourced from ChEMBL and PubChem databases, downloaded in SDF format.
Data Preparation: Structures were converted to SMILES with and without stereochemistry. A balanced dataset was created using DECIMER filtering rules, which restricted molecular weight (<1500 Da), allowed specific elements, limited bond count (3-40), and SMILES token count (<40).
Dataset Partitioning: The RDKit MaxMin algorithm was used to create chemically diverse training and test subsets, ensuring the test set covered relevant chemical space.
Image Generation: Production-quality 300x300 pixel PNG images were generated for each molecule using the Chemistry Development Kit (CDK) Structure Diagram Generator, with each structure rotated by a random angle.
Feature Extraction & Model Training: Image features were extracted using a pre-trained EfficientNet-B3 model. These features were fed into a transformer model trained to translate them into the different string representations (SMILES, SELFIES, DeepSMILES).
Tokenization:
- SELFIES were split into tokens using the natural breaks at square brackets.
- SMILES and DeepSMILES required custom rules, splitting after every heavy atom, bracket, bond symbol, and single-digit number.
Evaluation: Model performance was assessed based on the accuracy of translating images to the correct string representation.

Protocol for Large-Scale OCSR Benchmarking (MolParser)

The 2025 MolParser study established a new benchmark for robust OCSR using this protocol [69]:

Dataset Creation (MolParser-7M): A massive dataset of over 7.7 million image-SMILES pairs was constructed.
- Synthetic Data Generation: Molecular structures from ChEMBL, PubChem, and other sources were rendered into images using RDKit and epam.indigo with randomized parameters (bond width, font size, rotation) to ensure visual diversity.
- Real-World Data Curation (WildMol): An object detection model located and cropped over 20 million molecule images from 1.22 million real PDFs (patents, papers). After de-duplication, an active learning algorithm selected the most informative samples for manual annotation, creating a benchmark of 20,000 "in-the-wild" samples.
Representation: An Extended SMILES (E-SMILES) format was developed to represent complex structures like Markush structures, abstract rings, and polymers, while maintaining compatibility with RDKit.
Model Training: The MolParser model, an end-to-end image captioning architecture with a vision encoder and a BART decoder, was trained using curriculum learning. It was first pre-trained on diverse synthetic data and then fine-tuned on 400k real-world images.

Workflow Diagram of a Comparative OCSR Evaluation

The following diagram illustrates the general workflow for experimentally evaluating molecular string representations in an OCSR task, as implemented in recent studies [66] [69].

OCSR Representation Evaluation Workflow

The Scientist's Toolkit: Key Research Reagents & Platforms

Table 3: Essential Tools and Platforms for OCSR Research

Tool/Platform	Type	Primary Function in OCSR
RDKit	Cheminformatics Library	Molecular depiction generation, SMILES manipulation, and fingerprint calculation [66] [69] [67].
Chemistry Development Kit (CDK)	Cheminformatics Library	Open-source library for generating 2D structure depictions and manipulating chemical data [66] [67].
DECIMER Platform	Deep Learning OCSR Suite	Open-source platform for end-to-end chemical structure image segmentation, classification, and recognition [70].
MolParser	Deep Learning OCSR Model	State-of-the-art end-to-end model for recognizing chemical structures, including complex Markush structures [68] [69].
MolScribe	Deep Learning OCSR Model	A high-performing OCSR tool that achieved 66.4% accuracy on the WildMol benchmark [68].
SELFIES Python Package	Library	Enables conversion between SELFIES and SMILES representations, ensuring 100% molecular validity [5] [67].
WildMol & MolParser-7M	Benchmark Dataset	Large-scale, publicly available datasets for training and evaluating OCSR models on real-world data [68] [69].
MARCUS	Integrated Curation Platform	Web-based platform combining multiple OCSR engines and text annotation for extracting molecular data from natural product literature [71].

The evaluation of molecular string representations reveals a critical trade-off in OCSR performance. SMILES and its extended variants currently deliver the highest prediction accuracy on real-world benchmarks, as evidenced by MolParser's 76.9% accuracy on WildMol [68] [69]. However, SELFIES provides a crucial guarantee of molecular validity, which can significantly streamline automated data curation pipelines by eliminating syntax errors [66] [5]. The choice between representations depends on the specific application: SMILES-based systems are preferable for maximum raw accuracy where invalid outputs can be tolerated or filtered, while SELFIES is advantageous for fully automated workflows requiring robust and immediately usable results. Future developments will likely focus on hybrid approaches and enhanced representations that combine the high accuracy of SMILES with the robustness of SELFIES.

Accuracy in Stereochemistry and Complex Molecule Representation

In the field of artificial intelligence (AI)-driven drug discovery and materials science, the translation of molecular structures into a computer-readable format serves as the foundational step that significantly influences the success of all subsequent modeling tasks [3]. Molecular representation bridges the gap between chemical structures and their biological, chemical, or physical properties [3]. Among the various representation methods, Simplified Molecular-Input Line-Entry System (SMILES) and Self-Referencing Embedded Strings (SELFIES) have emerged as prominent string-based notations, each with distinct strengths and weaknesses in handling complex molecular features, particularly stereochemistry [1] [72]. Stereochemistry—the three-dimensional spatial arrangement of atoms—profoundly impacts a molecule's properties and biological activity [72]. For instance, the enantiomers of methadone demonstrate drastically different pharmacological effects: while one isomer provides pain relief, the other can cause severe cardiac side effects [72]. This review provides a comprehensive comparison of SMILES and SELFIES, evaluating their accuracy in representing stereochemistry and complex molecules through experimental data and benchmarking studies, offering researchers evidence-based guidance for selecting appropriate representation methods.

Technical Comparison of SMILES and SELFIES

Fundamental Principles and Encoding Mechanisms

SMILES (Simplified Molecular-Input Line-Entry System), introduced in 1988, provides a compact string representation of molecular structures using ASCII characters [26] [3]. Atoms are represented by their elemental symbols, bonds are denoted with specific characters (-, =, # for single, double, and triple bonds respectively), and branches are indicated using parentheses [2]. Stereochemistry is encoded using specialized symbols: "@" and "@@" denote clockwise and counter-clockwise chirality at tetrahedral centers, while "/" and "\" indicate stereochemistry around double bonds [72] [2]. Despite its widespread adoption, SMILES has inherent limitations in robustness, as randomly generated or mutated SMILES strings often produce semantically invalid molecular representations [1] [5].

SELFIES (Self-Referencing Embedded Strings), developed more recently, addresses SMILES' validity issues through a novel approach based on a formal grammar [1] [72]. Each symbol in a SELFIES string derives its meaning from the previous symbols, ensuring that every possible string corresponds to a valid molecular graph [72]. This robustness is achieved through "overloaded tokens" and local definitions of rings and branches, eliminating the problem of invalid structures that plagues SMILES-based generative models [72]. SELFIES maintains the ability to represent stereochemical information using conventions similar to SMILES, with specialized tokens for chiral centers and E/Z isomerism [72].

Handling of Stereochemistry and Complex Molecular Features

Stereochemical representation remains a critical challenge for molecular string representations. Both SMILES and SELFIES natively encode key forms of stereoisomerism, including E/Z geometric diastereomers (from restricted rotation around double bonds) and R/S enantiomers/diastereomers (from chiral centers) [72]. However, important nuances exist in their implementations:

SMILES represents chirality at tetrahedral centers using "@" and "@@" tokens, and E/Z stereoisomers with "\" and "/" characters before bond symbols [72]. While descriptive, this representation can be ambiguous in complex scenarios and susceptible to invalidity when strings are modified [1].
SELFIES utilizes similar characters for stereochemical notation but embeds them within its robust grammatical framework [72]. This ensures that stereochemical information remains consistent and valid even after string manipulation. GroupSELFIES, an extension of SELFIES, further enhances stereochemical representation by defining chirality through unique tokens for each chiral center with specified attachment points [72].

Both representations have limitations in capturing more complex stereochemical phenomena such as axial chirality (atropisomers) and nontetrahedral isomerism, which are particularly relevant for transition metal complexes [72].

Table 1: Comparison of SMILES and SELFIES Representation Capabilities

Feature	SMILES	SELFIES
Representation Basis	Line notation with ASCII characters [26]	Grammar-based with self-referencing tokens [72]
Guaranteed Validity	No [1]	Yes [72]
Stereochemistry Encoding	"@", "@@", "/", "\" symbols [72]	Similar symbols within robust grammar [72]
Human Readability	Moderate [3]	Lower [72]
Generative Model Performance	Higher invalid generation rates [1]	Higher validity rates in generative tasks [72]
Handling of Complex Molecules	Struggles with organometallics, complex biologics [1]	More robust but still limited for complex stereochemistry [72]

Experimental Performance and Benchmarking

Quantitative Performance in Molecular Modeling Tasks

Rigorous benchmarking studies provide empirical evidence for comparing SMILES and SELFIES across various computational chemistry tasks. Performance varies significantly based on the specific application, tokenization methods, and model architectures employed.

Table 2: Performance Comparison of SMILES vs. SELFIES Across Different Tasks

Task Type	Dataset	Metric	SMILES Performance	SELFIES Performance	Notes
Biophysics/Physiology Classification [1]	HIV, Toxicology, BBB Penetration	ROC-AUC	Superior with APE tokenization [1]	Inferior to SMILES+APE [1]	BPE tokenization underperformed for both
Generative Model Validity [72]	Custom benchmark	% Valid Molecules	Lower validity rates [72]	Near-perfect validity [72]	Particularly evident in mutation operations
Molecular Property Prediction [2]	SIDER	ROC-AUC	Baseline performance [2]	+5.97% improvement with augmentation [2]	Classical LSTM models
Quantum-Classical Hybrid Models [2]	SIDER	ROC-AUC	Baseline performance [2]	+5.91% improvement with augmentation [2]	QK-LSTM models
Stereochemistry-Aware Generation [72]	Circular Dichroism	Fitness Score	Competitive but lower validity [72]	Equal or superior performance [72]	Dependent on task sensitivity to stereochemistry

A critical study comparing tokenization methods revealed that the novel Atom Pair Encoding (APE) tokenizer, particularly when combined with SMILES, significantly outperformed traditional Byte Pair Encoding (BPE) with both SMILES and SELFIES representations in biophysics and physiology classification tasks [1] [26]. This advantage was consistent across three benchmark datasets: HIV, toxicology, and blood-brain barrier penetration, with evaluation based on ROC-AUC metrics [1].

For generative tasks, SELFIES demonstrates a clear advantage in validity. Experiments show that SELFIES consistently produces valid molecules with random mutations, whereas SMILES often generates invalid strings under similar conditions [72]. The latent space of SELFIES-based variational autoencoders is denser by two orders of magnitude, enabling more comprehensive exploration of chemical space during optimization procedures [1].

Stereochemistry-Aware Modeling Performance

The incorporation of stereochemical information presents both challenges and opportunities for molecular representations. Research comparing stereochemistry-aware and unaware string-based generative models reveals that:

Stereochemistry-aware models generally perform on par with or surpass conventional algorithms across various stereochemistry-sensitive tasks, including structure similarity, drug activity, and optical activity optimization [72].
The performance advantage of stereochemistry-aware models is highly dependent on the specific task. For tasks where stereochemistry plays a less critical role, these models may face challenges due to the increased complexity of the chemical space they must navigate [72].
In experiments utilizing reinforced learning (REINVENT) and genetic algorithms (JANUS) with different string representations, stereochemistry-aware models demonstrated particular strength in tasks specifically designed to assess stereochemical optimization, such as those based on circular dichroism spectra [72].

Experimental Workflow for Molecular Representation Benchmarking

Essential Research Reagents and Computational Tools

Successful implementation of molecular representation studies requires specific computational tools and benchmark resources. The following table details essential "research reagents" for conducting rigorous comparisons between representation methods.

Table 3: Essential Research Reagents for Molecular Representation Studies

Resource Name	Type	Primary Function	Relevance to Representation Comparison
MoleculeNet [73]	Benchmark Suite	Standardized evaluation for molecular ML	Provides curated datasets (HIV, Tox21, etc.) and metrics for fair comparison
ZINC15 [72]	Molecular Database	Source of drug-like molecules	Supplies stereochemically defined compounds for training and testing
RDKit [72]	Cheminformatics Toolkit	Molecule manipulation and descriptor calculation	Handles conversion between representations and stereochemical assignment
DeepChem [73]	Deep Learning Library	Implementation of molecular ML models	Provides built-in support for both SMILES and SELFIES processing
Transformers Library [1]	NLP Framework	Implementation of BERT and other architectures	Enables language model approaches to molecular representation
APE Tokenizer [1]	Specialized Algorithm	Atom-aware tokenization for chemical strings	Enhances SMILES performance by preserving chemical context

The comparative analysis of SMILES and SELFIES reveals a nuanced landscape where neither representation universally dominates across all applications. The choice between them should be guided by specific research priorities and task requirements:

For generative molecular design tasks, particularly those employing variational autoencoders, reinforcement learning, or genetic algorithms, SELFIES is generally preferred due to its guaranteed validity and more explorable latent space [72]. Its robustness against invalid structure generation significantly accelerates the discovery process.
For classification and regression tasks involving established molecular datasets, SMILES combined with advanced tokenization methods like APE may yield superior performance [1] [26]. The preservation of contextual relationships between chemical elements enhances predictive accuracy in these applications.
For stereochemistry-sensitive applications, the decision is more complex. While both representations can encode stereochemical information, SELFIES provides more reliable conservation of this information during string manipulation and generation [72]. However, the increased complexity of stereochemistry-aware search spaces may not always justify the computational cost for tasks where stereochemistry is secondary [72].
Emerging approaches including GroupSELFIES [72], augmented SELFIES [2], and multimodal representations [3] point toward future developments where hybrid approaches or entirely new paradigms may overcome current limitations.

Researchers should carefully consider their specific objectives, validity requirements, and stereochemical emphasis when selecting between SMILES and SELFIES representations. As molecular machine learning continues to evolve, the optimal choice may increasingly depend on the integration of representation with specialized tokenization methods and model architectures tailored to specific chemical intelligence tasks.

The selection of an appropriate molecular representation is a foundational decision in AI-assisted drug discovery and materials science [17]. The accuracy of downstream tasks, from property prediction to de novo molecular generation, is intrinsically linked to how a molecule is represented for a computer model [1]. This guide provides a comparative evaluation of the dominant molecular string representations—SMILES, SELFIES, SMARTS, and IUPAC—framed within the broader thesis that no single representation is universally superior; rather, the optimal choice is dictated by the specific use case and the model employed [17] [1]. We summarize critical performance data and detail experimental methodologies to equip researchers with the information needed to make informed decisions.

Systematic Comparison of Molecular Representations

The performance of molecular representations varies significantly across different tasks and model architectures. The following tables synthesize quantitative findings from recent comparative studies.

Table 1: Performance Summary in Molecular Generation Tasks (Diffusion Models)

Representation	Validity Rate	Novelty & Diversity	QED Score	SA Score	QEPPI Score
SMILES	Moderate [66]	Substantial differences [17]	Moderate [17]	Best [17]	Best [17]
SELFIES	100% [74] [16]	High similarity to SMARTS [17]	Best [17]	Good [17]	Good [17]
SMARTS	High [17]	High similarity to SELFIES [17]	Best [17]	Good [17]	Good [17]
IUPAC	High [17]	Best [17]	Moderate [17]	Moderate [17]	Moderate [17]

Table 2: Performance Summary in Predictive Modeling Tasks

Representation	Tokenization	HIV (ROC-AUC)	Toxicity (ROC-AUC)	BBB Penetration (ROC-AUC)	SIDER (ROC-AUC)
SMILES	BPE [1]	0.805 [1]	0.855 [1]	0.954 [1]	-
SMILES	APE [1]	0.824 [1]	0.881 [1]	0.965 [1]	-
SELFIES	BPE [1]	0.781 [1]	0.842 [1]	0.943 [1]	-
SELFIES	APE [1]	0.794 [1]	0.861 [1]	0.951 [1]	-
SELFIES (Augmented)	QK-LSTM [2]	-	-	-	0.759 [2]
SMILES (Augmented)	QK-LSTM [2]	-	-	-	0.717 [2]

Table 3: Suitability for Different Use Cases

Use Case	Recommended Representation	Key Rationale
Deep Generative Models (e.g., VAEs, GAs)	SELFIES	100% robustness ensures all generated strings are valid molecules, simplifying model architecture [74] [16].
High-Accuracy Property Prediction	SMILES (with APE tokenization)	Demonstrates superior performance in classification tasks like toxicology and BBB penetration [1].
Optical Chemical Recognition	SMILES	Showed the best overall performance in translating chemical images to structures [66].
Exploring Novel Chemical Space	IUPAC	Excels in generating novel and diverse molecules in diffusion models [17].
Reaction Prediction & Rule-based	SMARTS	Designed for representing molecular patterns and reaction rules [17].

Detailed Experimental Protocols

To critically assess the comparative data, an understanding of the underlying experimental methodologies is essential.

Diffusion Model-Based Generation Study

A 2025 study provided a direct comparison of SMILES, SELFIES, SMARTS, and IUPAC representations using a consistent diffusion model framework [17].

Objective: To evaluate the impact of molecular representation on the quality and diversity of AI-generated molecules.
Methodology:
- Representation Conversion: A single set of molecules was converted into the four different representation languages.
- Model Training: A denoising diffusion model was trained for each representation type using identical parameters and training data.
- Molecule Generation: For each trained model, 30,000 new molecules were generated.
- Evaluation: The generated molecules were analyzed for attribute distribution (e.g., molecular weight), spatial distribution in chemical space, and key metrics including Quantitative Estimate of Drug-likeness (QED), Synthetic Accessibility score (SAscore), and QEPPI score [17].
Key Findings: The four representations produced molecules with both similarities and significant differences. SELFIES and SMARTS showed high similarity in output, while IUPAC and SMILES diverged substantially. IUPAC excelled in novelty, SMILES in SAscore and QEPPI, and SELFIES/SMARTS achieved the best QED scores [17].

Tokenization and Predictive Modeling Study

This research investigated how tokenization methods affect chemical language model performance on benchmark classification tasks [1] [26].

Objective: To assess the performance of SMILES versus SELFIES using Byte Pair Encoding (BPE) and a novel Atom Pair Encoding (APE) tokenizer in BERT-based models.
Methodology:
- Tokenization: SMILES and SELFIES strings from benchmark datasets (HIV, Toxicity, Blood-Brain Barrier) were tokenized using both BPE and APE. APE was specifically designed to preserve chemical integrity by tokenizing atom pairs and their contextual relationships [1].
- Model Training & Evaluation: BERT-based models were trained and evaluated using the different tokenization-representation pairs. Model performance was measured using the ROC-AUC metric for each classification task [1].
Key Findings: The APE tokenizer significantly outperformed BPE across representations. Notably, the combination of SMILES with APE tokenization achieved the highest ROC-AUC scores, indicating that preserving chemical context is critical for predictive accuracy [1].

Optical Chemical Structure Recognition Study

This study compared string representations for the specific task of translating 2D chemical images into computer-readable structures [66].

Objective: To evaluate SMILES, DeepSMILES, SELFIES, and InChI for chemical image recognition using transformer models.
Methodology:
- Dataset Preparation: Millions of chemical structures from ChEMBL and PubChem were converted into bitmap images and the respective string representations [66].
- Model Training: Transformer models were trained to map the input chemical images to the different string-based output representations.
- Evaluation: The accuracy of the translation from image to structure was measured for each representation type [66].
Key Findings: SMILES representation achieved the best overall translation performance. A critical noted advantage of SELFIES was that every predicted string, even if imperfectly translated, guaranteed a valid molecular structure, whereas SMILES could result in invalid syntax [66].

Workflow and Relationship Visualizations

Comparative Evaluation Workflow for Molecular Representations

The following diagram illustrates the standard workflow for empirically comparing different molecular representations, as employed in the cited studies.

Tokenization Methods for Chemical Language Models

This diagram contrasts the two primary tokenization methods, BPE and APE, used to process molecular strings for transformer-based models.

Table 4: Essential Tools for Molecular Representation Research

Tool / Resource	Function	Relevance in Research
RDKit	An open-source cheminformatics toolkit [66].	Used for molecule manipulation, visualization, descriptor calculation, and converting between representation formats. Critical for data preprocessing and validation.
SELFIES (Python Package)	A library for encoding and decoding SELFIES strings [74] [16].	Installed via `pip install selfies`. Essential for any research workflow involving the generation or analysis of SELFIES representations.
MoleculeNet Benchmark	A standardized benchmark suite for molecular machine learning [1] [2].	Provides curated datasets (e.g., HIV, Tox21, SIDER) and evaluation protocols, ensuring fair and consistent comparison of model performance across studies.
Transformer Models (e.g., BERT)	Neural network architectures for sequence processing [1].	The standard model architecture for evaluating representation and tokenization schemes in predictive classification tasks.
Diffusion Models	A class of generative models [17].	Increasingly the state-of-the-art for evaluating molecular representation in generative tasks like de novo molecule design.
Atom Pair Encoding (APE)	A specialized tokenization method for chemical strings [1].	A key reagent for enhancing model performance by ensuring tokens map to chemically meaningful units like atoms and bonds in context.

Conclusion

The evaluation of SMILES and SELFIES reveals a nuanced landscape where the optimal molecular representation is highly task-dependent. SMILES often demonstrates strong performance in classification and prediction tasks, particularly when paired with advanced tokenizers like Atom Pair Encoding (APE). In contrast, SELFIES provides critical robustness for generative models by guaranteeing molecular validity, though it may produce longer token sequences. Emerging fragment-based representations like fragSMILES show great promise, especially in handling chirality and synthetic planning. Future directions point toward hybrid models that combine the strengths of string-based and graph-based representations, alongside increased focus on data quality and model interpretability. These advancements will significantly accelerate AI-driven drug discovery by enabling more accurate, efficient, and reliable exploration of chemical space.