Using ChEMBL-SQL database for Molecular Data Curation

Sriram
3 min readNov 16, 2021

--

ChEMBL database is the primary source of data for doing any large-scale analysis in early Drug Discovery. Accessing data from ChEMBL is often seamless, which I have been using for the last three years. But Sometimes, getting data from ChEMBL API is slow, and when you are still exploring what kind of data you need and how the query should look, it can take forever. Local installation of the ChEMBL database helps alleviate the issue and makes data curation fast and more workable. In this blog, my objective is to try to summarize how I am using the local ChEMBL SQL database and psycopg2 python package to curate a list of all approved drugs.

What do you need to follow this blog?

To follow this blog, you need to have a local installation of the ChEMBL SQL cartridge. Follow the steps detailed in Iwatobipen’s blog and from RDKit documentation.

Once you have the PostgreSQL and local ChEMBL database up and running, we are ready to go!

This command below starts a server that we can use to interact with the ChEMBL29 SQL database.

postgres -D ~/postgresdata

Using local ChEMBL SQL database

Now we can use SQL language to interact with the database. For this demo, let us try to extract all the approved drugs. Here is how that SQL query looks like

SELECT DISTINCT m.chembl_id AS compound_chembl_id,
s.canonical_smiles,
r.compound_key
FROM compound_structures s
RIGHT JOIN molecule_dictionary m ON s.molregno = m.molregno
JOIN compound_records r ON m.molregno = r.molregno
AND m.max_phase = 4;

When you run the above query, it generates a long list of ~4000 molecules that are in Phase 4 (or are approved)

ChEMBL documentation has an excellent tutorial on crafting SQL queries and the schema and was very helpful to get started. I am a SQL novice and whatever I learned is from the Intro to SQL course from DataCamp. I highly recommend it for folks who want to learn the basics of SQL.

I mainly prefer using Python as I do all my exploratory data analysis, so I wanted to figure out a way to interact with the SQL database with Python. Packages such as pychembldb, razi are summarized in Iwatobipen’s blogs and are great starting points. I found the psycopg2 package to be useful as it can keep the feel of using SQL queries intact, and there is a ton of support for this package in StackOverflow.

pip install psycopg2

Here is how the same SQL script looks like in Python employing psycopg2

The above python script generates a CSV file with ChEMBL id, smiles code, and some more exciting information about all the approved drugs.

We can then use this data to do any analysis we want. I hope you found this helpful.

References

--

--

No responses yet