🚀 Step by step guide

🚀 Step by step guide#

Welcome to this notebook demonstrating the initial use of RobotU Molkit, a Python toolkit designed to streamline chemical data processing by integrating AI-powered enrichment and similarity search capabilities.

robotu-molkit connects directly to PubChem to fetch and structure raw compound data, then leverages IBM Granite models for advanced semantic processing. This includes:

🧪 Molecule Summarization: Automatically generate clear, concise natural-language summaries of chemical compounds.
🧠 Semantic Embedding Generation: Transform structured chemical data into high-dimensional embeddings optimized for similarity search.
🔍 Canonical Query Interpretation: Understand and standardize natural-language queries, enabling smarter search workflows.
🔗 Similarity Search: Perform deep similarity analysis using both semantic embeddings and structural metrics (e.g., Tanimoto similarity).
⚡ Optional FAST Mode: Use lightweight embeddings for speed-critical tasks or when computational resources are limited.

This notebook will walk through the core functionalities of the library, from ingesting molecular records to performing semantic embedding and search. Whether you’re building a searchable compound database, analyzing structure–activity relationships, or simply exploring AI-enhanced cheminformatics, RobotU Molkit offers a modular and powerful foundation.

⚙️ Before running the examples, make sure to install the package using:
pip install robotu-molkit

⚠️ Before you begin…

To use RobotU Molkit’s AI features, you must first configure your IBM watsonx credentials (API Key and Project ID).

If you haven’t done that yet, follow the step-by-step guide here:

➡️ Set up your watsonx credentials →

This is required before running semantic enrichment or vector-based search.

!pip install robotu-molkit

Step 1 – Ingest 91 Molecules from PubChem#

The command below uses the molkit ingest subcommand to fetch compound records from PubChem by CID (Compound ID), and parse them into structured Molecule JSON files ready for enrichment and analysis. The molecules_cids.txt has 91 CID of molecules. You can find their names in molecules_cids_with_names.csv file.

!molkit ingest --file "molecule_cids.txt" --concurrency 2

What This Does#

Downloads raw 3D compound records from PubChem using the PUG-REST API.
Parses each record into a Molecule JSON object (with schema-compliant fields).
Stores both raw and parsed data in designated folders under data/.

Available Options#

Option	Description
`--file`, `-f`	Path to a text file with one CID per line. Useful for bulk ingestion.
`--concurrency`, `-c`	Number of parallel workers fetching data (default: `5`). You may reduce this to avoid hitting PubChem’s rate limits (max 5 requests/sec, 400/min).
`--raw-dir`, `-r`	Directory to store raw JSON files from PubChem (default: `data/downloaded_data`).
`--parsed-dir`, `-p`	Directory to store parsed Molecule JSON payloads (default: `data/parsed`).
`CID(s)` (positional)	You can also pass one or more CIDs directly as arguments instead of using `--file`.

⚠️ Note: Either --file or positional CID(s) must be provided. If none are supplied, the CLI will exit with an error.

Example with Direct CIDs#

!molkit ingest 2244 5957 2519 --concurrency 3

Rate Limits and Reliability#

To respect PubChem’s API limits:

Maximum 5 requests per second
Maximum 400 requests per minute
Each request has a timeout of 30 seconds

Using --concurrency 2 is a safe option that balances performance and reliability, especially when fetching a large number of compounds.

In the next step, we’ll enrich these molecules with natural-language summaries and semantic embeddings.

!molkit ingest --file "molecule_cids.txt" --concurrency 2

INFO: Starting ingest of 91 CIDs...
Procesado registro CID 2519
Procesado registro CID 5429
Procesado registro CID 2153
Procesado registro CID 4534
Procesado registro CID 4744
Procesado registro CID 3758
Procesado registro CID 89594
Procesado registro CID 8378
Procesado registro CID 7028
Procesado registro CID 118943
Procesado registro CID 5816
Procesado registro CID 439260
Procesado registro CID 6041
Procesado registro CID 4236
Procesado registro CID 2244
Procesado registro CID 3672
Procesado registro CID 1983
Procesado registro CID 156391
Procesado registro CID 3033
Procesado registro CID 3825
Procesado registro CID 3715
Procesado registro CID 2662
Procesado registro CID 54676228
Procesado registro CID 3306
Procesado registro CID 3394
Procesado registro CID 4781
Procesado registro CID 4055
Procesado registro CID 4485
Procesado registro CID 4783
Procesado registro CID 4044
Procesado registro CID 5345
Procesado registro CID 5120
Procesado registro CID 3826
Procesado registro CID 681
Procesado registro CID 5202
Procesado registro CID 774
Procesado registro CID 119
Procesado registro CID 33032
Procesado registro CID 187
Procesado registro CID 896
Procesado registro CID 1123
Procesado registro CID 239
Procesado registro CID 6047
Procesado registro CID 1001
Procesado registro CID 6057
Procesado registro CID 1150
Procesado registro CID 5901
Procesado registro CID 1030
Procesado registro CID 865
Procesado registro CID 439203
Procesado registro CID 612
Procesado registro CID 311
Procesado registro CID 176
Procesado registro CID 284
Procesado registro CID 243
Procesado registro CID 338
Procesado registro CID 1110
Procesado registro CID 971
Procesado registro CID 107735
Procesado registro CID 525
Procesado registro CID 54670067
Procesado registro CID 750
Procesado registro CID 5950
Procesado registro CID 6137
Procesado registro CID 5957
Procesado registro CID 5961
Procesado registro CID 232
Procesado registro CID 247
Procesado registro CID 1176
Procesado registro CID 586
Procesado registro CID 2236
Procesado registro CID 5184
Procesado registro CID 5288826
Procesado registro CID 5284371
Procesado registro CID 245005
Procesado registro CID 6167
Procesado registro CID 2353
Procesado registro CID 8549
Procesado registro CID 5280953
Procesado registro CID 5770
Procesado registro CID 4727
Procesado registro CID 969516
Procesado registro CID 445154
Procesado registro CID 5280343
Procesado registro CID 5280961
Procesado registro CID 9064
Procesado registro CID 1548943
Procesado registro CID 6989
Procesado registro CID 3314
Procesado registro CID 1183
Procesado registro CID 5281515
INFO: Done! Raw → data/downloaded_data | Parsed → data/parsed

Step 2 – Configuring Watsonx Credentials#

Store your IBM Watsonx credentials locally using:

!molkit config --watsonx-api-key "your-api-key" --watsonx-project-id "your-project-id"

Credentials are stored at: ~/.config/molkit/config.json

These credentials are then globally available to all molkit commands in your environment.

Runtime Credential Resolution#

If you need to resolve credentials dynamically (e.g. inside a script or notebook), use the CredentialsManager utility. It checks in the following order:

CLI overrides
Environment variables (IBM_API_KEY, IBM_PROJECT_ID, WATSONX_URL)
Local config file (~/.config/molkit/config.json)

Usage Guide – `CredentialsManager`#

from molkit.auth import CredentialsManager

# Load credentials
api_key, project_id = CredentialsManager.load()

# Load Watsonx service URL
watsonx_url = CredentialsManager.get_watsonx_url()

# Set or update values in ~/.config/molkit/config.json
CredentialsManager.set_api_key("your-api-key")
CredentialsManager.set_project_id("your-project-id")
CredentialsManager.set_watsonx_url("your-watsonx-url")

This is the preferred way to inject or retrieve credentials programmatically during embedding or search workflows.

!molkit config   --watsonx-api-key WATSONX_API_KEY   --watsonx-project-id WATSONX_PROJECT_ID

Credentials saved to /home/user/.config/molkit/config.json

Step 3 – Generate Semantic Embeddings with IBM Watsonx#

Once the molecules are parsed from PubChem, the next step is enriching each entry using advanced semantic embedding models provided by IBM Granite. The molkit embed command generates concise natural-language summaries and semantic embeddings for each parsed molecule.

!molkit embed --parsed-dir "data/parsed" --out-dir "data/vectors" --fast

What Happens During Embedding?#

Reads each structured Molecule JSON from the parsed-dir folder.
Generates:
- A concise natural-language summary using the Granite Instruct model (ibm/granite-3-8b-instruct).
- High-dimensional semantic embeddings to support similarity searches.
Stores these embeddings in a JSONL file (by default at data/vectors/watsonx_vectors.jsonl).

CLI Options for `molkit embed`#

Option	Description
`--parsed-dir`, `-p`	Directory containing parsed molecule JSON files (default: `data/parsed`).
`--out-dir`, `-o`	Directory to store the generated embeddings (default: `data/vectors`).
`--model`, `-m`	Embedding model ID. (default: `ibm/granite-embedding-278m-multilingual`).
`--fast`	Use the faster, smaller model (`ibm/granite-embedding-107m-multilingual`).
`--watsonx-api-key`, `-k`	IBM Watsonx API key (overrides environment or configuration settings).
`--watsonx-project-id`, `-j`	IBM Watsonx Project ID (overrides environment or configuration settings).
`--watsonx-url`	Watsonx API endpoint URL (default: `https://us-south.ml.cloud.ibm.com`).

⚠️ Important: IBM credentials are required and must be provided through the configuration file, environment variables, or explicitly via the --watsonx-api-key and --watsonx-project-id flags.

Choosing the Right Model#

Default Mode (`ibm/granite-embedding-278m-multilingual`)#

Embedding dimensions: 1024
Advantages:
- Higher precision and greater semantic accuracy.
- Ideal for detailed chemical analysis, precise molecular similarity searches, drug discovery, and predictive modeling tasks.
Recommended for chemists who require detailed semantic insights and maximum embedding quality.

FAST Mode (`ibm/granite-embedding-107m-multilingual`)#

Embedding dimensions: 768
Advantages:
- Lower computational cost and faster execution times.
- Best suited for quick exploratory analyses, high-throughput scenarios, or resource-constrained environments.
Recommended for initial testing or prototyping.

Example Commands#

Default model usage (recommended for accuracy):

!molkit embed

FAST model usage (recommended for speed):

!molkit embed --fast

What’s Next?#

The resulting watsonx_vectors.jsonl file contains a summary and a high-dimensional embedding for each molecule, ready for local vector search using faiss-cpu.

!molkit embed --watsonx-url="https://us-south.ml.cloud.ibm.com"

Embedding model: ibm/granite-embedding-278m-multilingual
✅  Embeddings written to data/vectors/watsonx_vectors.jsonl

Step 4 - Local Semantic Search#

With your molecule embeddings in place, you can now run semantic queries entirely offline using faiss-cpu.

We’ll leverage the LocalSearch class to:

Enrich your natural-language query via Granite Instruct
Generate a matching embedding with the same model used earlier
Perform a similarity search against the FAISS index

Let’s execute our first query.

from robotu_molkit.search.searcher import LocalSearch
from robotu_molkit.constants import DEFAULT_JSONL_FILE_ROUTE

# Path to the JSONL file with precomputed embeddings and metadata
JSONL_PATH    = DEFAULT_JSONL_FILE_ROUTE  # e.g. "data/vectors/watsonx_vectors.jsonl"

# Minimum similarity score (cosine) to consider a hit relevant
SIM_THRESHOLD = 0.70

# After filtering by SIM_THRESHOLD, return this many top results
TOP_K         = 20

# Number of nearest neighbors to fetch from FAISS before filtering
FAISS_K       = 300
# --------------------------------------------------------------------------- #

# Initialize searcher
searcher = LocalSearch(jsonl_path=JSONL_PATH)

Simple Semantic Search Example#

In this example, we’ll run a purely semantic query against our local FAISS index of 91 compounds, focusing on central nervous system stimulants.

Query:

“central nervous system stimulants”

Expected top hits include methylxanthine derivatives like caffeine, theobromine, and theophylline.

# Define query and metadata filters
query_text = "central nervous system stimulants"

# Perform semantic + structural search
results = searcher.search_by_semantics(
    query_text=query_text, top_k=20, faiss_k=300,
)

Displaying Results as a Table#

Format the top semantic-search hits into a plain-text table for clear comparison.

# Prepare table header and rows
header = f"{'CID':<8} {'Name':<20} {'MW':<8} {'Sol':<10} {'Score':<6}"
rows = []
for m, s in results:
    cid  = m['cid']
    name = m.get('name', '<unknown>')
    mw   = m.get('molecular_weight', 0)
    sol  = m.get('solubility_tag', '')
    rows.append(f"{cid:<8} {name:<20} {mw:<8.1f} {sol:<10} {s:<.3f}")

# Print query and table
print(f"Results for query: \"{query_text}\"\n")
print(header)
print('-' * len(header))
print("\n".join(rows))

Results for query: "central nervous system stimulants"

CID      Name                 MW       Sol        Score 
--------------------------------------------------------
187      acetylcholine        146.2    soluble    0.620
5202     serotonin            176.2    moderately soluble 0.620
5770     reserpine            608.7    insoluble  0.616
245005   aconitine            645.7    sparingly soluble 0.614
439260   norepinephrine       169.2    soluble    0.609
774      histamine            111.1    soluble    0.592
1548943  Capsaicin            305.4    sparingly soluble 0.587
89594    nicotine             162.2    moderately soluble 0.580
1150     tryptamine           160.2    moderately soluble 0.579
239      beta-alanine         89.1     very soluble 0.577
681      dopamine             153.2    soluble    0.574
5184     (-)-Scopolamine      303.4    moderately soluble 0.572
896      Melatonin            232.3    moderately soluble 0.572
7028     PSEUDOEPHEDRINE      165.2    moderately soluble 0.571
1983     acetaminophen        151.2    moderately soluble 0.570
33032    L-glutamic acid      147.1    very soluble 0.570
586      creatine             131.1    very soluble 0.569
2519     caffeine             194.2    soluble    0.564
156391   NAPROXEN             230.3    sparingly soluble 0.562
5816     epinephrine          183.2    soluble    0.560

Result Analysis: Semantic Query – “Central Nervous System Stimulants”#

We performed a semantic-only search across a local FAISS index of 91 molecules using the query:

“central nervous system stimulants”

The model ranked molecules based on their conceptual relevance to the query, using no structural constraints.

Observations#

Top-ranked results include caffeine, nicotine, pseudoephedrine, dopamine, serotonin, and epinephrine — all compounds with well-documented stimulant activity or involvement in neural signaling.
Neurotransmitters like acetylcholine, norepinephrine, and histamine also appear, reflecting the model’s understanding of their role in CNS excitation.
Some biosynthetic precursors or modulators (e.g., tryptamine, beta-alanine, melatonin) are included due to their indirect relevance in neurophysiology.
A few results like reserpine, aconitine, or naproxen are false positives in this context, but they still share textual or pharmacological proximity in literature.

What This Tells Us#

Granite Instruct effectively enriched the input query, capturing both direct stimulants and adjacent biochemical actors.
Granite Embedding produced vectors that reflect semantic relationships, retrieving molecules that span neurotransmitters, methylxanthines, and sympathomimetics.
No structural similarity was enforced — hence the occasional inclusion of pharmacologically unrelated compounds.

This confirms that semantic-only search is useful for broad hypothesis generation or contextual exploration, especially when the query involves functional descriptions instead of precise chemical patterns.

For more refined control, we can now combine semantic signals with structural filters.

Combined Semantic + Structural Search#

To improve result specificity, we now combine semantic similarity with structural filtering using:

search_by_semantics_and_structure

This method first identifies semantically relevant candidates, then refines the list by applying Tanimoto similarity against inferred molecular scaffolds.

What It Does#

🧠 Granite Instruct enriches and interprets the user query to infer one or more scaffold SMILES.
🧪 Granite Embedding embeds the query to locate semantically similar molecules in the FAISS index.
🧬 Tanimoto filtering compares structural fingerprints of candidates to the inferred scaffolds and keeps only those above a similarity threshold (e.g. ≥ 0.70).
✅ Final results combine semantic relevance and structural alignment.

Why Use It?#

This hybrid approach filters out semantically plausible but structurally unrelated molecules, giving priority to compounds that are both functionally and chemically close to the query.

Perfect for:

Querying drug analogues
Scanning scaffold-specific chemotypes
Prioritizing leads in a known structural class

Let’s try it with the same query:

“central nervous system stimulants”

This time, we expect to retrieve methylxanthine-like structures with more consistent chemistry.

# Query string
query_text = "central nervous system stimulants"

# Run combined semantic + structural search
results = searcher.search_by_semantics_and_structure(
    query_text=query_text,
    top_k=TOP_K,
    faiss_k=FAISS_K,
    sim_threshold=0.45
)

# Format results
entries = [
    f"CID {m['cid']} Name:{m.get('name','<unknown>')} "
    f"MW:{m.get('molecular_weight',0):.1f} "
    f"Sol:{m.get('solubility_tag','')} "
    f"Score:{s:.3f} Tanimoto:{sim:.2f}"
    for m, s, sim in results
]

print(
    f"Results for query: \"{query_text}\"\n"
    f"Top {len(entries)} hits (Granite-inferred scaffolds, Tanimoto ≥ {SIM_THRESHOLD}):\n"
    + "\n".join(entries)
    + "\n\nNote: Scaffold inference was performed using IBM's granite-3-8b-instruct model. "
      "Semantic and structural similarity search was powered by granite-embedding-278m-multilingual."
)

🔍 Inferred scaffolds: ['amphetamine', 'methamphetamine', 'caffeine']
⚠️ Failed to resolve scaffold 'amphetamine': 'CID 3007 not found in index.'
⚠️ Failed to resolve scaffold 'methamphetamine': 'CID 10836 not found in index.'
✅ CID 2519 for 'caffeine' → ECFP vector loaded
→ CID 187 Name:acetylcholine Tanimoto: 0.07
→ CID 5202 Name:serotonin Tanimoto: 0.04
→ CID 5770 Name:reserpine Tanimoto: 0.07
→ CID 245005 Name:aconitine Tanimoto: 0.06
→ CID 439260 Name:norepinephrine Tanimoto: 0.07
→ CID 774 Name:histamine Tanimoto: 0.07
→ CID 1548943 Name:Capsaicin Tanimoto: 0.10
→ CID 89594 Name:nicotine Tanimoto: 0.13
→ CID 1150 Name:tryptamine Tanimoto: 0.04
→ CID 239 Name:beta-alanine Tanimoto: 0.03
→ CID 681 Name:dopamine Tanimoto: 0.05
→ CID 5184 Name:(-)-Scopolamine Tanimoto: 0.12
→ CID 896 Name:Melatonin Tanimoto: 0.07
→ CID 7028 Name:PSEUDOEPHEDRINE Tanimoto: 0.10
→ CID 1983 Name:acetaminophen Tanimoto: 0.10
→ CID 33032 Name:L-glutamic acid Tanimoto: 0.02
→ CID 586 Name:creatine Tanimoto: 0.05
→ CID 2519 Name:caffeine Tanimoto: 1.00
→ CID 156391 Name:NAPROXEN Tanimoto: 0.10
→ CID 5816 Name:epinephrine Tanimoto: 0.06
→ CID 9064 Name:Cianidanol Tanimoto: 0.05
→ CID 311 Name:citric acid Tanimoto: 0.03
→ CID 3672 Name:ibuprofen Tanimoto: 0.11
→ CID 6041 Name:phenylephrine Tanimoto: 0.06
→ CID 4485 Name:nifedipine Tanimoto: 0.09
→ CID 2353 Name:berberine Tanimoto: 0.08
→ CID 4744 Name:PERAZINE Tanimoto: 0.10
→ CID 3715 Name:indomethacin Tanimoto: 0.08
→ CID 2244 Name:aspirin Tanimoto: 0.09
→ CID 4781 Name:phenylbutazone Tanimoto: 0.13
→ CID 6047 Name:levodopa Tanimoto: 0.06
→ CID 2662 Name:celecoxib Tanimoto: 0.15
→ CID 3826 Name:ketorolac Tanimoto: 0.08
→ CID 54676228 Name:piroxicam Tanimoto: 0.14
→ CID 5284371 Name:codeine Tanimoto: 0.07
→ CID 4534 Name:NORDIHYDROGUAIARETIC ACID Tanimoto: 0.10
→ CID 5961 Name:L-glutamine Tanimoto: 0.02
→ CID 3033 Name:diclofenac Tanimoto: 0.08
→ CID 247 Name:betaine Tanimoto: 0.08
→ CID 750 Name:glycine Tanimoto: 0.03
→ CID 232 Name:DL-Arginine Tanimoto: 0.02
→ CID 1123 Name:taurine Tanimoto: 0.03
→ CID 6167 Name:colchicine Tanimoto: 0.07
→ CID 4236 Name:modafinil Tanimoto: 0.06
→ CID 2153 Name:theophylline Tanimoto: 0.47
→ CID 5288826 Name:morphine Tanimoto: 0.08
→ CID 5280343 Name:quercetin Tanimoto: 0.07
→ CID 3306 Name:Etilefrine Tanimoto: 0.06
→ CID 119 Name:4-Aminobutanoic acid Tanimoto: 0.05
→ CID 2236 Name:Aristolochic acid Tanimoto: 0.06
→ CID 6137 Name:L-methionine Tanimoto: 0.05
→ CID 3394 Name:flurbiprofen Tanimoto: 0.10
→ CID 3825 Name:ketoprofen Tanimoto: 0.11
→ CID 6989 Name:THYMOL Tanimoto: 0.12
→ CID 4727 Name:DL-Penicillamine Tanimoto: 0.05
→ CID 969516 Name:curcumin Tanimoto: 0.08
→ CID 5280961 Name:genistein Tanimoto: 0.08
→ CID 6057 Name:L-tyrosine Tanimoto: 0.07
→ CID 5901 Name:6-AZAURIDINE Tanimoto: 0.15
→ CID 5280953 Name:HARMINE Tanimoto: 0.10
→ CID 8378 Name:Framycetin Tanimoto: 0.02
→ CID 54670067 Name:l-ascorbic acid Tanimoto: 0.07
→ CID 3314 Name:eugenol Tanimoto: 0.06
→ CID 1183 Name:vanillin Tanimoto: 0.09
→ CID 445154 Name:resveratrol Tanimoto: 0.05
→ CID 971 Name:oxalic acid Tanimoto: 0.03
→ CID 4044 Name:mefenamic acid Tanimoto: 0.10
→ CID 5429 Name:theobromine Tanimoto: 0.52
→ CID 1110 Name:succinic acid Tanimoto: 0.03
→ CID 5950 Name:L-alanine Tanimoto: 0.09
→ CID 5957 Name:Adenosine triphosphate Tanimoto: 0.14
→ CID 5281515 Name:BETA-CARYOPHYLLENE Tanimoto: 0.08
→ CID 284 Name:formic acid Tanimoto: 0.03
→ CID 1001 Name:Phenethylamine Tanimoto: 0.05
→ CID 8549 Name:Quininae Tanimoto: 0.08
→ CID 3758 Name:3-Isobutyl-1-methylxanthine Tanimoto: 0.35
→ CID 5120 Name:IBZM Tanimoto: 0.08
→ CID 243 Name:benzoic acid Tanimoto: 0.09
→ CID 5345 Name:Sulfobromophthalein Tanimoto: 0.07
→ CID 865 Name:2,6-Diaminopimelic acid Tanimoto: 0.03
→ CID 1176 Name:urea Tanimoto: 0.03
→ CID 107735 Name:pyruvate Tanimoto: 0.06
→ CID 4783 Name:phenyliodoundecynoate Tanimoto: 0.06
→ CID 612 Name:lactic acid Tanimoto: 0.09
→ CID 118943 Name:o-Isobutyltoluene Tanimoto: 0.10
→ CID 176 Name:acetic acid Tanimoto: 0.07
→ CID 338 Name:salicylic acid Tanimoto: 0.08
→ CID 4055 Name:menadione Tanimoto: 0.15
→ CID 525 Name:malic acid Tanimoto: 0.03
→ CID 1030 Name:propylene glycol Tanimoto: 0.06
→ CID 439203 Name:D-Arabinoketose Tanimoto: 0.05
✅ 3 of 91 passed Tanimoto ≥ 0.45
Results for query: "central nervous system stimulants"
Top 3 hits (Granite-inferred scaffolds, Tanimoto ≥ 0.7):
CID 2519 Name:caffeine MW:194.2 Sol:soluble Score:0.564 Tanimoto:1.00
CID 2153 Name:theophylline MW:180.2 Sol:soluble Score:0.535 Tanimoto:0.47
CID 5429 Name:theobromine MW:180.2 Sol:soluble Score:0.512 Tanimoto:0.52

Note: Scaffold inference was performed using IBM's granite-3-8b-instruct model. Semantic and structural similarity search was powered by granite-embedding-278m-multilingual.

Result Analysis: Combined Semantic + Structural Search (Tanimoto ≥ 0.45)#

We executed a query for:

“central nervous system stimulants”

Using search_by_semantics_and_structure, this time with a Tanimoto threshold of 0.45.

Inferred Scaffolds#

amphetamine ❌ not present in the demo index
methamphetamine ❌ not present in the demo index
caffeine ✅ resolved and used for fingerprint comparison

Only caffeine was found among the 91 indexed molecules and used as the reference scaffold.

Top Hits (Tanimoto ≥ 0.45)#

CID	Name	Score	Tanimoto
2519	caffeine	0.564	1.00
2153	theophylline	0.535	0.47
5429	theobromine	0.512	0.52

These results reflect canonical methylxanthine CNS stimulants:

Caffeine is the direct scaffold.
Theobromine and theophylline are well-known derivatives with similar stimulant properties and shared structural motifs.

Interpretation#

The combination of semantic and structural similarity successfully retrieved the most relevant compounds in the context of this query.

Given that this is a demo subset with 91 molecules, these results demonstrate that:

Semantic enrichment correctly interpreted the functional intent of the query.
Scaffold-based filtering prioritized structurally consistent molecules.
Granite models are capable of aligning language and chemistry for precise and explainable search results.

Filtering Semantic Search with Metadata#

In addition to semantic and structural similarity, robotu-molkit allows you to apply filters to the molecule metadata returned by the query.

This enables powerful combinations like:

Search for functional concepts (e.g. “anti-inflammatory drugs”)
Filter by solubility, molecular weight, toxicity, or any numeric/tagged field

How Filters Work#

Filters are passed as a dictionary where keys correspond to metadata fields. You can use:

Exact match
{ "solubility_tag": "soluble" }
Range filters (numeric values)
{ "molecular_weight": (150, 300) }
Multiple allowed values
{ "solubility_tag": ["soluble", "moderately soluble"] }

These filters are applied after semantic similarity is computed, helping to refine the final results.

Example Query: “Anti-inflammatory agents”#

In this example, we’ll search for molecules related to:

“Anti-inflammatory agents”

…and restrict the results to those with:

Molecular weight between 200 and 400
Solubility tag = “sparingly soluble” or “moderately soluble”

This should surface NSAIDs (non-steroidal anti-inflammatory drugs) like ibuprofen, naproxen, diclofenac, etc., based on their semantics and physical properties.

# Query and filters
query_text = "anti-inflammatory agents"
filters = {
    "molecular_weight": (200, 400),
    "solubility_tag": ["sparingly soluble", "moderately soluble"]
}

# Run filtered semantic search
results = searcher.search_by_semantics(
    query_text=query_text,
    top_k=TOP_K,
    faiss_k=FAISS_K,
    filters=filters
)

# Print results
header = f"{'CID':<8} {'Name':<20} {'MW':<8} {'Solubility':<20} {'Score':<6}"
rows = [
    f"{m['cid']:<8} {m.get('name','<unknown>'):<20} "
    f"{m.get('molecular_weight', 0):<8.1f} {m.get('solubility_tag',''):<20} {s:<.3f}"
    for m, s in results
]

print(f"Results for query: \"{query_text}\" with filters:\n{filters}\n")
print(header)
print("-" * len(header))
print("\n".join(rows))

Results for query: "anti-inflammatory agents" with filters:
{'molecular_weight': (200, 400), 'solubility_tag': ['sparingly soluble', 'moderately soluble']}

CID      Name                 MW       Solubility           Score 
------------------------------------------------------------------
3672     ibuprofen            206.3    sparingly soluble    0.642
54676228 piroxicam            331.4    sparingly soluble    0.626
3825     ketoprofen           254.3    sparingly soluble    0.625
3394     flurbiprofen         244.3    sparingly soluble    0.617
3826     ketorolac            255.3    sparingly soluble    0.614
1548943  Capsaicin            305.4    sparingly soluble    0.611
5184     (-)-Scopolamine      303.4    moderately soluble   0.608
156391   NAPROXEN             230.3    sparingly soluble    0.606
4236     modafinil            273.4    moderately soluble   0.603
5280343  quercetin            302.2    sparingly soluble    0.593
9064     Cianidanol           290.3    moderately soluble   0.590
4044     mefenamic acid       241.3    sparingly soluble    0.588
896      Melatonin            232.3    moderately soluble   0.586
4485     nifedipine           346.3    sparingly soluble    0.576
5280953  HARMINE              212.3    sparingly soluble    0.570
5284371  codeine              299.4    moderately soluble   0.561
445154   resveratrol          228.2    sparingly soluble    0.561
5280961  genistein            270.2    sparingly soluble    0.560
5288826  morphine             285.3    moderately soluble   0.557
8549     Quininae             324.4    sparingly soluble    0.551

Filtered Semantic Search – Anti-inflammatory Agents#

We ran a semantic query for:

“anti-inflammatory agents”

…using search_by_semantics with two metadata filters:

{
    "molecular_weight": (200, 400),
    "solubility_tag": ["sparingly soluble", "moderately soluble"]
}

This combination retrieved molecules that are:

Semantically related to anti-inflammatory activity
Within a realistic drug-like mass range
Limited to medium or low aqueous solubility

Top Hits (Filtered)#

CID	Name	MW	Solubility	Score
3672	ibuprofen	206.3	sparingly soluble	0.642
54676228	piroxicam	331.4	sparingly soluble	0.626
3825	ketoprofen	254.3	sparingly soluble	0.625
3394	flurbiprofen	244.3	sparingly soluble	0.617
3826	ketorolac	255.3	sparingly soluble	0.614
…	…	…	…	…

Most top results are well-known NSAIDs (non-steroidal anti-inflammatory drugs), such as:

Ibuprofen
Naproxen
Ketoprofen
Piroxicam
Flurbiprofen

Also retrieved were compounds like:

Capsaicin – a topical anti-inflammatory
Quercetin, Genistein, Resveratrol – flavonoids with anti-inflammatory potential
Scopolamine, Modafinil, Harmine – functionally related or literature-associated hits

Summary#

The semantic model correctly captured anti-inflammatory meaning.
The filters refined the space to drug-like, orally bioavailable candidates.
This search illustrates how combining semantic queries with physicochemical constraints can support targeted molecule exploration even in small curated sets.

This demonstrates a practical use case for medicinal chemistry, formulation profiling, or compound triage.

What’s Next for `robotu-molkit`#

The current release of robotu-molkit already supports advanced semantic search, molecular summarization, and FAISS-based local querying — all powered by IBM Granite models.

But the next evolution is coming soon.

Coming Soon: Natural Language–Driven Code Generation#

The next release will introduce support for:

granite-instruct-code
An IBM Granite model designed to generate and execute code in response to natural-language queries.

With this, robotu-molkit will be able to:

Interpret free-form queries — even if they don’t contain explicit filters or numbers.
Infer filters automatically — such as molecular weight ranges, solubility classes, or structural constraints.
Generate and run internal code — directly from natural language, with no manual configuration needed.

Example (Future Behavior)#

Input:

“Show me small analgesics with moderate solubility”

The library will:

Enrich the query with granite-instruct.
Use granite-instruct-code to:
- Interpret “small” as molecular_weight < 300
- Interpret “moderate solubility” as a tag filter
- Combine everything into a valid search operation
Execute the search internally and return ranked results — all from one sentence.

Why It Matters#

This makes robotu-molkit:

Even more accessible to scientists who think in goals, not filters
Able to automatically convert text into queries
A foundation for intelligent chemical exploration interfaces

Stay tuned — natural-language–driven molecule discovery is almost here.

RobotU’s Vision#

robotu-molkit is just the foundation of RobotU’s bigger vision: creating an AI-powered, quantum-ready simulation platform for chemists and chemical engineers.

Think of an experience like Shuri in Wakanda, interacting seamlessly with her AI in a holographic environment to virtually explore molecular structures, run instant simulations, and discover compounds—like the lost heart-shaped herb.

RobotU aims to bring this futuristic interaction closer to reality:

Natural language commands: Interact effortlessly using everyday language, powered by IBM Granite AI models.
Seamless integration: Automatic retrieval, filtering, and preparation of molecular data for simulations.
Quantum simulation: Leveraging libraries like Qiskit Nature and IBM Quantum systems to execute advanced molecular simulations.
Visual insights: Immediate visual feedback through an intuitive graphical interface, displaying simulation outcomes interactively.

robotu-molkit is our first step toward making molecule discovery and quantum simulations accessible to everyone—no supercomputers, vibranium, or holograms required (yet).