Using ChEMBL-SQL database for Molecular Data Curation

3 min readNov 16, 2021

ChEMBL database is the primary source of data for doing any large-scale analysis in early Drug Discovery. Accessing data from ChEMBL is often seamless, which I have been using for the last three years. But Sometimes, getting data from ChEMBL API is slow, and when you are still exploring what kind of data you need and how the query should look, it can take forever. Local installation of the ChEMBL database helps alleviate the issue and makes data curation fast and more workable. In this blog, my objective is to try to summarize how I am using the local ChEMBL SQL database and psycopg2 python package to curate a list of all approved drugs.

What do you need to follow this blog?

To follow this blog, you need to have a local installation of the ChEMBL SQL cartridge. Follow the steps detailed in Iwatobipen’s blog and from RDKit documentation.

Once you have the PostgreSQL and local ChEMBL database up and running, we are ready to go!

This command below starts a server that we can use to interact with the ChEMBL29 SQL database.

postgres -D ~/postgresdata

Using local ChEMBL SQL database

Now we can use SQL language to interact with the database. For this demo, let us try to extract all the approved drugs. Here is how that SQL query looks like

SELECT DISTINCT m.chembl_id AS compound_chembl_id,
s.canonical_smiles,
r.compound_key
FROM compound_structures s
RIGHT JOIN molecule_dictionary m ON s.molregno = m.molregno
JOIN compound_records r ON m.molregno = r.molregno
AND m.max_phase      = 4;

When you run the above query, it generates a long list of ~4000 molecules that are in Phase 4 (or are approved)

ChEMBL documentation has an excellent tutorial on crafting SQL queries and the schema and was very helpful to get started. I am a SQL novice and whatever I learned is from the Intro to SQL course from DataCamp. I highly recommend it for folks who want to learn the basics of SQL.

I mainly prefer using Python as I do all my exploratory data analysis, so I wanted to figure out a way to interact with the SQL database with Python. Packages such as pychembldb, razi are summarized in Iwatobipen’s blogs and are great starting points. I found the psycopg2 package to be useful as it can keep the feel of using SQL queries intact, and there is a ton of support for this package in StackOverflow.

pip install psycopg2

Here is how the same SQL script looks like in Python employing psycopg2

The above python script generates a CSV file with ChEMBL id, smiles code, and some more exciting information about all the approved drugs.

We can then use this data to do any analysis we want. I hope you found this helpful.

References

5 Tips for Embedding Tables in Your Medium Posts

Data Publishing in the “Modern” Age

medium.com

Embed a GitHub Gist Code Snippet in a Medium Article

Make Your Code Snippets Shine on Medium Scenario You are writing a Medium article, and wish to include code snippets…

www.linkedin.com

Install ChEMBL28 & rdkit cartridge #chemoinformatics #RDKit

Recently ChEMBL 28 was released. It's good news for chemoinformaticitan and time to update your chembldb ;) Of course I…

iwatobipen.wordpress.com

Communicate ChEMBL27 with rdkit postgres cartridge and sqlalchemy #RDKit #ChEMBL #Postgres #razi

As you know ChEMBL 27 was released recently, thanks great effort for EBI ;)…

iwatobipen.wordpress.com

The RDKit database cartridge - The RDKit 2021.09.1 documentation

This document is a tutorial and reference guide for the RDKit PostgreSQL cartridge. If you find mistakes, or have…

www.rdkit.org

Schema Questions and SQL Examples

ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest

chembl.gitbook.io

Using ChEMBL-SQL database for Molecular Data Curation

What do you need to follow this blog?

Using local ChEMBL SQL database

References

5 Tips for Embedding Tables in Your Medium Posts

Data Publishing in the “Modern” Age

Embed a GitHub Gist Code Snippet in a Medium Article

Make Your Code Snippets Shine on Medium Scenario You are writing a Medium article, and wish to include code snippets…

Install ChEMBL28 & rdkit cartridge #chemoinformatics #RDKit

Recently ChEMBL 28 was released. It's good news for chemoinformaticitan and time to update your chembldb ;) Of course I…

Communicate ChEMBL27 with rdkit postgres cartridge and sqlalchemy #RDKit #ChEMBL #Postgres #razi

As you know ChEMBL 27 was released recently, thanks great effort for EBI ;)…

The RDKit database cartridge - The RDKit 2021.09.1 documentation

This document is a tutorial and reference guide for the RDKit PostgreSQL cartridge. If you find mistakes, or have…

Schema Questions and SQL Examples

ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest

Written by Sriram

No responses yet

Using ChEMBL-SQL database for Molecular Data Curation

What do you need to follow this blog?

Using local ChEMBL SQL database

References

5 Tips for Embedding Tables in Your Medium Posts

Data Publishing in the “Modern” Age

Embed a GitHub Gist Code Snippet in a Medium Article

Make Your Code Snippets Shine on Medium Scenario You are writing a Medium article, and wish to include code snippets…

Install ChEMBL28 &amp; rdkit cartridge #chemoinformatics #RDKit

Recently ChEMBL 28 was released. It's good news for chemoinformaticitan and time to update your chembldb ;) Of course I…

Communicate ChEMBL27 with rdkit postgres cartridge and sqlalchemy #RDKit #ChEMBL #Postgres #razi

As you know ChEMBL 27 was released recently, thanks great effort for EBI ;)…

The RDKit database cartridge - The RDKit 2021.09.1 documentation

This document is a tutorial and reference guide for the RDKit PostgreSQL cartridge. If you find mistakes, or have…

Schema Questions and SQL Examples

ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest

Written by Sriram

No responses yet

Install ChEMBL28 & rdkit cartridge #chemoinformatics #RDKit