You can run this notebook online in a Binder session or view it on Github.

Getting Started

In this example, we will show how to:

  • Find existing ML datasets on the QCArchive.

  • Extract geometries and quantum chemical results from the datasets.

  • Evaluate ML models on existing datasets.

  • Compute new data using your own instance of QCArchive.

The demonstration is organized into three ML examples:

  • Unsupervised learning: use manifold learning to understand the structure of the QM7b, QM7b-T, and SN2 Reaction datasets.

  • Supervised learning: train a kernel model to predict atomization energies with the ANI-1 dataset, and test it on a COMP6 benchmark.

  • Dataset creation: Make a new dataset and compute DFT energy labels and neural network predictions.

Connect to the database

MolSSI

The Molecular Sciences Software Institute hosts the Quantum Chemistry Archive (QCArchive) and makes its data available to the entire Computational Molecular Sciences community free of charge. The QCArchive is both a database to view, analyze, and explore existing data as well as a live instance that continuously generates new data as directed by the community.

QCArchive

The primary interface to any QCArchive database is the the qcportal Python package which can be downloaded via pip (pip install -e qcportal) or conda (conda install qcportal -c conda-forge).

(Documentation: http://docs.qcarchive.molssi.org/projects/QCPortal/en/stable/)

The primary interface to a database server is a through a FractalClient. We can connect to api.qcarchive.molssi.org to get access to all data contained within the MolSSI server.

[2]:
import numpy as np
import pandas as pd

import qcportal as ptl

client = ptl.FractalClient(address="api.qcarchive.molssi.org")
client
[2]:

FractalClient

  • Server:   The MolSSI QCArchive Server
  • Address:   https://api.qcarchive.molssi.org/
  • Username:   None

Exploring collections

We organize datasets into “collections”. QCArchive hosts many data collections, including benchmarks for electronic structure development, PES scans for force field fitting, in addition to the ML datasets discussed earlier. The list_collections function returns all of the collections on the server:

[3]:
client.list_collections()
[3]:
tagline
collection name
Dataset ANI-1 22 million off-equilibrium conformations and e...
COMP6 ANI-MD Benchmark containing MD trajectories from the ...
COMP6 DrugBank Benchmark containing DrugBank off-equilibrium ...
COMP6 GDB10to13 Benchmark containing off-equilibrium molecules...
COMP6 GDB7to9 Benchmark containing off-equilibrium molecules...
... ... ...
TorsionDriveDataset OpenFF Primary TorsionDrive Benchmark 1 None
OpenFF Substituted Phenyl Set 1 None
Pfizer Discrepancy Torsion Dataset 1 None
SMIRNOFF Coverage Torsion Set 1 None
TorsionDrive Paper None

82 rows × 1 columns

Here we focus on machine learning datasets, searching for collections with the “machine learning” tag:

[4]:
client.list_collections(tag="machine learning")
[4]:
tagline
collection name
Dataset ANI-1 22 million off-equilibrium conformations and e...
COMP6 ANI-MD Benchmark containing MD trajectories from the ...
COMP6 DrugBank Benchmark containing DrugBank off-equilibrium ...
COMP6 GDB10to13 Benchmark containing off-equilibrium molecules...
COMP6 GDB7to9 Benchmark containing off-equilibrium molecules...
COMP6 S66x8 Benchmark for noncovalent interactions.
COMP6 Tripeptides Benchmark containing off-equilibrium geometrie...
G-SchNet Generated Molecules generated by G-SchNet, trained on QM9.
GDB13-T Small organic molecules with up to 13 heavy at...
GDML Molecular dynamics trajectories of small molec...
ISO-17 Molecular dynamics trajectories of isomers of ...
QM7 Small organic molecules with up to 7 heavy atoms.
QM7b Small organic molecules with up to 7 heavy ato...
QM7b-T Small organic molecules with up to 7 heavy ato...
QM9 Small organic molecules with up to 9 heavy ato...
SN2 Reactions Chemical reactions of methyl halides with hali...
Solvated Protein Fragments Amons derived from proteins, in water.

These are the same as the ML datasets the QCArchive website

https://qcarchive.molssi.org/apps/ml_datasets

image.png

Looking at the QM7b dataset

Collections can be obtained from the server with get_collection. A collection object is light-weight, initially containing only metadata; extremely large datasets (such as ANI-1) can be pulled in a few seconds. For this example, we will start with the QM7b dataset. To obtain this collection:

[6]:
qm7b = client.get_collection("dataset", "qm7b")
print_info(qm7b)
Name: QM7b

Data Points: 7211
Elements: ['C', 'H', 'Cl', 'N', 'O', 'S']
Labels: ['atomization energy', 'excitation energy', 'lumo', 'ionization potential', 'electron affinity', 'polarizability', 'absorption intensity', 'homo']
Description: Small organic molecules with up to 7 heavy atoms, sampled from GDB-13 and optimized at the PBE0 level of theory. Ground and excited state properties are evaluated at the PBE0, ZINDO, and GW levels of theory. This dataset is also available on quantum-machine.org and qmml.org.
Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. JACS, 2009, 131, 8732-8733.
Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.-R. & Von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys., 2013, 15, 095003.

Getting molecules

Datasets contain two types of data:

  • Molecules: Representations of molecules, containing atoms, geometry, charge, spin, fragments, etc.

  • Values: Properties of those molecules, such as the B3LYP/6-31G* energy

We can access the molecules in a dataset with the get_molecules function. This function returns a pandas DataFrame of Molecule objects.

[7]:
qm7b_mols = qm7b.get_molecules()
qm7b_mols
[7]:
molecule
index
0 Geometry (in Angstrom), charge = 0.0, mult...
1 Geometry (in Angstrom), charge = 0.0, mult...
10 Geometry (in Angstrom), charge = 0.0, mult...
100 Geometry (in Angstrom), charge = 0.0, mult...
1000 Geometry (in Angstrom), charge = 0.0, mult...
... ...
995 Geometry (in Angstrom), charge = 0.0, mult...
996 Geometry (in Angstrom), charge = 0.0, mult...
997 Geometry (in Angstrom), charge = 0.0, mult...
998 Geometry (in Angstrom), charge = 0.0, mult...
999 Geometry (in Angstrom), charge = 0.0, mult...

7211 rows × 1 columns

Visualizing molecules

Individual Molecule objects can be directly displayed in a Jupyter notebook:

[8]:
qm7b_mols["molecule"][100]  # get the element in column "molecule", row 100

Listing values

The list_values function shows what data are available in a collection. QM7b is fairly data rich, containing ground and excited state properties at the ZINDO, DFT, and GW levels.

[9]:
qm7b.list_values()
[9]:
keywords name
native driver program method basis
False e1 Orca ZINDO Unknown Unknown First excitation energy (ZINDO)
ea Orca ZINDO/s Unknown Unknown Electron affinity (ZINDO/s)
emax Orca ZINDO Unknown Unknown Excitation energy at maximal absorption (ZINDO)
energy FHI-aims pbe0 Unknown Unknown Atomization energy (DFT/PBE0)
homo FHI-aims GW Unknown Unknown Highest occupied molecular orbital (GW)
pbe0 Unknown Unknown Highest occupied molecular orbital (PBE0)
Orca ZINDO/s Unknown Unknown Highest occupied molecular orbital (ZINDO/s)
imax Orca ZINDO Unknown Unknown Maximal absorption intensity (ZINDO)
ip Orca ZINDO/s Unknown Unknown Ionization potential (ZINDO/s)
lumo FHI-aims GW Unknown Unknown Lowest unoccupied molecular orbital (GW)
pbe0 Unknown Unknown Lowest unoccupied molecular orbital (PBE0)
Orca ZINDO/s Unknown Unknown Lowest unoccupied molecular orbital (ZINDO/s)
polarizability FHI-aims Self-consistent screening Unknown Unknown Polarizability (self-consistent screening)
pbe0 Unknown Unknown Polarizability (DFT/PBE0)
True energy psi4 b2plyp aug-cc-pvdz scf_default B2PLYP/aug-cc-pvdz
aug-cc-pvtz scf_default B2PLYP/aug-cc-pvtz
def2-svp scf_default B2PLYP/def2-svp
def2-tzvp scf_default B2PLYP/def2-tzvp
sto-3g scf_default B2PLYP/sto-3g
b3lyp aug-cc-pvdz scf_default B3LYP/aug-cc-pvdz
aug-cc-pvtz scf_default B3LYP/aug-cc-pvtz
def2-svp scf_default B3LYP/def2-svp
def2-tzvp scf_default B3LYP/def2-tzvp
sto-3g scf_default B3LYP/sto-3g
wb97m-v aug-cc-pvdz scf_default WB97M-V/aug-cc-pvdz
aug-cc-pvtz scf_default WB97M-V/aug-cc-pvtz
def2-svp scf_default WB97M-V/def2-svp
def2-tzvp scf_default WB97M-V/def2-tzvp
sto-3g scf_default WB97M-V/sto-3g

Getting values

The get_values function pulls a data column down from the server. Values may be filtered by any of the fields described in list_values, including driver, program, method, basis, and name. Here, we show all calculation performed with the B3LYP functional.

[10]:
qm7b.units = "hartree"
qm7b.get_values(method="b3lyp")
[10]:
B3LYP/sto-3g B3LYP/aug-cc-pvdz B3LYP/def2-tzvp B3LYP/def2-svp B3LYP/aug-cc-pvtz
0 -40.0392 -40.5206 -40.5375 -40.4878 -40.5383
1 -78.8861 -79.8361 -79.8638 -79.7717 -79.8645
10 -131.067 -132.771 -132.81 -132.655 -132.808
100 -207.419 -210.09 -210.147 -209.908 -210.146
1000 -281.606 -285.326 -285.403 -285.072 -285.4
... ... ... ... ... ...
995 -282.86 -286.575 -286.651 -286.32 -286.649
996 -282.895 -286.656 -286.734 -286.405 -286.731
997 -282.861 -286.576 -286.652 -286.321 -286.65
998 -284.106 -287.805 -287.882 -287.549 -287.881
999 -284.143 -287.884 -287.963 -287.633 -287.961

7211 rows × 5 columns

Summary of QCPortal commands

ml_workflow.png

QCPortal is a Python interface to ML data hosted by MolSSI’s QCArchive. Accessing the data requires only five commands.

client = ptl.FractalClient()
ds = client.get_collection("dataset", name)
ds.get_molecules()
ds.list_values()
ds.get_values()

Extras

[ ]:
from IPython.core.display import HTML

def print_info(dataset):
    print(f"Name: {dataset.data.name}")
    print()
    print(f"Data Points: {dataset.data.metadata['data_points']}")
    print(f"Elements: {dataset.data.metadata['elements']}")
    print(f"Labels: {dataset.data.metadata['labels']}")

    display(HTML("<u>Description:</u> " + dataset.data.description))

    for cite in dataset.data.metadata["citations"]:
        display(HTML(cite['acs_citation']))