🚀 Step by step guide#

Welcome to this notebook demonstrating the initial use of RobotU Molkit, a Python toolkit designed to streamline chemical data processing by integrating AI-powered enrichment and similarity search capabilities.

robotu-molkit connects directly to PubChem to fetch and structure raw compound data, then leverages IBM Granite models for advanced semantic processing. This includes:

  • đź§Ş Molecule Summarization: Automatically generate clear, concise natural-language summaries of chemical compounds.

  • đź§  Semantic Embedding Generation: Transform structured chemical data into high-dimensional embeddings optimized for similarity search.

  • 🔍 Canonical Query Interpretation: Understand and standardize natural-language queries, enabling smarter search workflows.

  • đź”— Similarity Search: Perform deep similarity analysis using both semantic embeddings and structural metrics (e.g., Tanimoto similarity).

  • ⚡ Optional FAST Mode: Use lightweight embeddings for speed-critical tasks or when computational resources are limited.

This notebook will walk through the core functionalities of the library, from ingesting molecular records to performing semantic embedding and search. Whether you’re building a searchable compound database, analyzing structure–activity relationships, or simply exploring AI-enhanced cheminformatics, RobotU Molkit offers a modular and powerful foundation.

⚙️ Before running the examples, make sure to install the package using:
pip install robotu-molkit

⚠️ Before you begin…

To use RobotU Molkit’s AI features, you must first configure your IBM watsonx credentials (API Key and Project ID).

If you haven’t done that yet, follow the step-by-step guide here:

➡️ Set up your watsonx credentials →

This is required before running semantic enrichment or vector-based search.

!pip install robotu-molkit

Step 1 – Ingest 91 Molecules from PubChem#

The command below uses the molkit ingest subcommand to fetch compound records from PubChem by CID (Compound ID), and parse them into structured Molecule JSON files ready for enrichment and analysis. The molecules_cids.txt has 91 CID of molecules. You can find their names in molecules_cids_with_names.csv file.

!molkit ingest --file "molecule_cids.txt" --concurrency 2

What This Does#

  • Downloads raw 3D compound records from PubChem using the PUG-REST API.

  • Parses each record into a Molecule JSON object (with schema-compliant fields).

  • Stores both raw and parsed data in designated folders under data/.

Available Options#

Option

Description

--file, -f

Path to a text file with one CID per line. Useful for bulk ingestion.

--concurrency, -c

Number of parallel workers fetching data (default: 5). You may reduce this to avoid hitting PubChem’s rate limits (max 5 requests/sec, 400/min).

--raw-dir, -r

Directory to store raw JSON files from PubChem (default: data/downloaded_data).

--parsed-dir, -p

Directory to store parsed Molecule JSON payloads (default: data/parsed).

CID(s) (positional)

You can also pass one or more CIDs directly as arguments instead of using --file.

⚠️ Note: Either --file or positional CID(s) must be provided. If none are supplied, the CLI will exit with an error.

Example with Direct CIDs#

!molkit ingest 2244 5957 2519 --concurrency 3

Rate Limits and Reliability#

To respect PubChem’s API limits:

  • Maximum 5 requests per second

  • Maximum 400 requests per minute

  • Each request has a timeout of 30 seconds

Using --concurrency 2 is a safe option that balances performance and reliability, especially when fetching a large number of compounds.


In the next step, we’ll enrich these molecules with natural-language summaries and semantic embeddings.

!molkit ingest --file "molecule_cids.txt" --concurrency 2
INFO: Starting ingest of 91 CIDs...
Procesado registro CID 2519
Procesado registro CID 5429
Procesado registro CID 2153
Procesado registro CID 4534
Procesado registro CID 4744
Procesado registro CID 3758
Procesado registro CID 89594
Procesado registro CID 8378
Procesado registro CID 7028
Procesado registro CID 118943
Procesado registro CID 5816
Procesado registro CID 439260
Procesado registro CID 6041
Procesado registro CID 4236
Procesado registro CID 2244
Procesado registro CID 3672
Procesado registro CID 1983
Procesado registro CID 156391
Procesado registro CID 3033
Procesado registro CID 3825
Procesado registro CID 3715
Procesado registro CID 2662
Procesado registro CID 54676228
Procesado registro CID 3306
Procesado registro CID 3394
Procesado registro CID 4781
Procesado registro CID 4055
Procesado registro CID 4485
Procesado registro CID 4783
Procesado registro CID 4044
Procesado registro CID 5345
Procesado registro CID 5120
Procesado registro CID 3826
Procesado registro CID 681
Procesado registro CID 5202
Procesado registro CID 774
Procesado registro CID 119
Procesado registro CID 33032
Procesado registro CID 187
Procesado registro CID 896
Procesado registro CID 1123
Procesado registro CID 239
Procesado registro CID 6047
Procesado registro CID 1001
Procesado registro CID 6057
Procesado registro CID 1150
Procesado registro CID 5901
Procesado registro CID 1030
Procesado registro CID 865
Procesado registro CID 439203
Procesado registro CID 612
Procesado registro CID 311
Procesado registro CID 176
Procesado registro CID 284
Procesado registro CID 243
Procesado registro CID 338
Procesado registro CID 1110
Procesado registro CID 971
Procesado registro CID 107735
Procesado registro CID 525
Procesado registro CID 54670067
Procesado registro CID 750
Procesado registro CID 5950
Procesado registro CID 6137
Procesado registro CID 5957
Procesado registro CID 5961
Procesado registro CID 232
Procesado registro CID 247
Procesado registro CID 1176
Procesado registro CID 586
Procesado registro CID 2236
Procesado registro CID 5184
Procesado registro CID 5288826
Procesado registro CID 5284371
Procesado registro CID 245005
Procesado registro CID 6167
Procesado registro CID 2353
Procesado registro CID 8549
Procesado registro CID 5280953
Procesado registro CID 5770
Procesado registro CID 4727
Procesado registro CID 969516
Procesado registro CID 445154
Procesado registro CID 5280343
Procesado registro CID 5280961
Procesado registro CID 9064
Procesado registro CID 1548943
Procesado registro CID 6989
Procesado registro CID 3314
Procesado registro CID 1183
Procesado registro CID 5281515
INFO: Done! Raw → data/downloaded_data | Parsed → data/parsed

Step 2 – Configuring Watsonx Credentials#

Store your IBM Watsonx credentials locally using:

!molkit config --watsonx-api-key "your-api-key" --watsonx-project-id "your-project-id"

Credentials are stored at: ~/.config/molkit/config.json

These credentials are then globally available to all molkit commands in your environment.


Runtime Credential Resolution#

If you need to resolve credentials dynamically (e.g. inside a script or notebook), use the CredentialsManager utility. It checks in the following order:

  1. CLI overrides

  2. Environment variables (IBM_API_KEY, IBM_PROJECT_ID, WATSONX_URL)

  3. Local config file (~/.config/molkit/config.json)


Usage Guide – CredentialsManager#

from molkit.auth import CredentialsManager

# Load credentials
api_key, project_id = CredentialsManager.load()

# Load Watsonx service URL
watsonx_url = CredentialsManager.get_watsonx_url()

# Set or update values in ~/.config/molkit/config.json
CredentialsManager.set_api_key("your-api-key")
CredentialsManager.set_project_id("your-project-id")
CredentialsManager.set_watsonx_url("your-watsonx-url")

This is the preferred way to inject or retrieve credentials programmatically during embedding or search workflows.

!molkit config   --watsonx-api-key WATSONX_API_KEY   --watsonx-project-id WATSONX_PROJECT_ID
Credentials saved to /home/user/.config/molkit/config.json

Step 3 – Generate Semantic Embeddings with IBM Watsonx#

Once the molecules are parsed from PubChem, the next step is enriching each entry using advanced semantic embedding models provided by IBM Granite. The molkit embed command generates concise natural-language summaries and semantic embeddings for each parsed molecule.

!molkit embed --parsed-dir "data/parsed" --out-dir "data/vectors" --fast

What Happens During Embedding?#

  • Reads each structured Molecule JSON from the parsed-dir folder.

  • Generates:

    • A concise natural-language summary using the Granite Instruct model (ibm/granite-3-8b-instruct).

    • High-dimensional semantic embeddings to support similarity searches.

  • Stores these embeddings in a JSONL file (by default at data/vectors/watsonx_vectors.jsonl).


CLI Options for molkit embed#

Option

Description

--parsed-dir, -p

Directory containing parsed molecule JSON files (default: data/parsed).

--out-dir, -o

Directory to store the generated embeddings (default: data/vectors).

--model, -m

Embedding model ID. (default: ibm/granite-embedding-278m-multilingual).

--fast

Use the faster, smaller model (ibm/granite-embedding-107m-multilingual).

--watsonx-api-key, -k

IBM Watsonx API key (overrides environment or configuration settings).

--watsonx-project-id, -j

IBM Watsonx Project ID (overrides environment or configuration settings).

--watsonx-url

Watsonx API endpoint URL (default: https://us-south.ml.cloud.ibm.com).

⚠️ Important: IBM credentials are required and must be provided through the configuration file, environment variables, or explicitly via the --watsonx-api-key and --watsonx-project-id flags.


Choosing the Right Model#

Default Mode (ibm/granite-embedding-278m-multilingual)#

  • Embedding dimensions: 1024

  • Advantages:

    • Higher precision and greater semantic accuracy.

    • Ideal for detailed chemical analysis, precise molecular similarity searches, drug discovery, and predictive modeling tasks.

  • Recommended for chemists who require detailed semantic insights and maximum embedding quality.

FAST Mode (ibm/granite-embedding-107m-multilingual)#

  • Embedding dimensions: 768

  • Advantages:

    • Lower computational cost and faster execution times.

    • Best suited for quick exploratory analyses, high-throughput scenarios, or resource-constrained environments.

  • Recommended for initial testing or prototyping.


Example Commands#

Default model usage (recommended for accuracy):

!molkit embed

FAST model usage (recommended for speed):

!molkit embed --fast

What’s Next?#

The resulting watsonx_vectors.jsonl file contains a summary and a high-dimensional embedding for each molecule, ready for local vector search using faiss-cpu.

!molkit embed --watsonx-url="https://us-south.ml.cloud.ibm.com"
Embedding model: ibm/granite-embedding-278m-multilingual
âś…  Embeddings written to data/vectors/watsonx_vectors.jsonl

What’s Next for robotu-molkit#

The current release of robotu-molkit already supports advanced semantic search, molecular summarization, and FAISS-based local querying — all powered by IBM Granite models.

But the next evolution is coming soon.


Coming Soon: Natural Language–Driven Code Generation#

The next release will introduce support for:

granite-instruct-code
An IBM Granite model designed to generate and execute code in response to natural-language queries.

With this, robotu-molkit will be able to:

  • Interpret free-form queries — even if they don’t contain explicit filters or numbers.

  • Infer filters automatically — such as molecular weight ranges, solubility classes, or structural constraints.

  • Generate and run internal code — directly from natural language, with no manual configuration needed.


Example (Future Behavior)#

Input:

“Show me small analgesics with moderate solubility”

The library will:

  1. Enrich the query with granite-instruct.

  2. Use granite-instruct-code to:

    • Interpret “small” as molecular_weight < 300

    • Interpret “moderate solubility” as a tag filter

    • Combine everything into a valid search operation

  3. Execute the search internally and return ranked results — all from one sentence.


Why It Matters#

This makes robotu-molkit:

  • Even more accessible to scientists who think in goals, not filters

  • Able to automatically convert text into queries

  • A foundation for intelligent chemical exploration interfaces

Stay tuned — natural-language–driven molecule discovery is almost here.

RobotU’s Vision#

robotu-molkit is just the foundation of RobotU’s bigger vision: creating an AI-powered, quantum-ready simulation platform for chemists and chemical engineers.

Think of an experience like Shuri in Wakanda, interacting seamlessly with her AI in a holographic environment to virtually explore molecular structures, run instant simulations, and discover compounds—like the lost heart-shaped herb.

RobotU aims to bring this futuristic interaction closer to reality:

  • Natural language commands: Interact effortlessly using everyday language, powered by IBM Granite AI models.

  • Seamless integration: Automatic retrieval, filtering, and preparation of molecular data for simulations.

  • Quantum simulation: Leveraging libraries like Qiskit Nature and IBM Quantum systems to execute advanced molecular simulations.

  • Visual insights: Immediate visual feedback through an intuitive graphical interface, displaying simulation outcomes interactively.

robotu-molkit is our first step toward making molecule discovery and quantum simulations accessible to everyone—no supercomputers, vibranium, or holograms required (yet).