Distributional Graphormer:

Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning

Shuxin Zheng*†1, Jiyan He†1, Chang Liu*†1, Yu Shi†1, Ziheng Lu†1, Weitao Feng1, Fusong Ju1, Jiaxi Wang1, Jianwei Zhu1, Yaosen Min1, He Zhang1, Shidi Tang1, Hongxia Hao1, Peiran Jin1, Chi Chen2, Frank Noé1, Haiguang Liu*†1, and Tie-Yan Liu*1

1Microsoft Research AI4Science.
2Microsoft Quantum.

Abstract

Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as chemical graph or protein sequence. This framework enables efficiently generating diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science.

Protein conformation sampling and distribution in 2D subspace

DiG generated structures for two proteins of SARS-CoV-2 viruses: the receptor binding domain (RBD) of the spike protein and the main protease (or 3CL protease). The contour lines show the distribution of millisecond molecular dynamics simulation results. Experimentally determined structures (from the protein databank) and the predicted structures by AlphaFold are shown in the 2D maps. DiG generated structures are mapped onto 2D maps as dots.

Slide the bar to check the conformational space coverage as more structures are being generated.

Loading...

structure distributions for RBD

In the case of RBD, MD simulations reveal four clusters. Sliding the bar will reveal the distribution of DiG generated structures on this contour map. There is good coverage of the MD simulation structures in this 2D subspace. With 10,000 generated structures, the coverage ratio is about 72%. AlphaFold predicted structures are distributed in the lower right region.

Loading...

structure distributions for main protease (3CL protease)

The structures from MD simulations are populated in three regions. DiG generated structures have good overlaps with the middle and lower regions. The conformation space around the top region is less covered by the present model, showing room for improvement. Note that AlphaFold predicted structures are concentrated in the middle region.

Generating functional conformation states

Many proteins have multiple metastable structures that are closely related to their functions.

Four proteins that have two distinguishable structures are shown to demonstrate DiG's capabilities in predicting diverse structures associated with different functional states.

Experimentally determined structures are shown as cylinder cartoons, DiG generated structures are superposed to these experimental structures in ribbon representation. Load 3D structures for interactive visualization.



Experimentally determined structures corresponding to two function states are shown in cylinder cartoon representation (Blue vs Red colors). DiG generated structures (in thin ribbon representation) are superposed to the experimental structures. Use the mouse to drag the molecules for visualizations from different views.

Conformation transition pathway prediction

DiG can generate plausible transition pathways that connect the two structures. Adenylate kinase and LmrP membrane protein both have two functional states, corresponding to open and closed conformations. Using interpolation approach, DiG generates pathways connecting open and closed states.

Adenylate kinase

LmrP membrane protein

The open/closed structures are shown as semi-transparent ribbons, and the structures along the predicted pathways are shown as cartoons (each secondary structure component is colored differently for better visualizations).

Ligand binding structure generation for given protein pocket

Process of ligand generation

The following animations show the process of ligand structure generation. The atoms of ligands are shown as spheres, which gradually converge to their final positions, predicting the binding poses within given protein pockets.


Tyk2 binding with

Tyk2lig

P38 binding with

p38lig

More examples of the predicted binding structures

The figures can be interactively visualized by manipulating the molecules with a mouse pointer. Scroll left or right for more examples.

Adsorbate configuration sampling on catalytic surfaces

Stable adsorbate configuration sampling on catalytic surfaces

Diffusion process of an acyl group on a stepped TiIr alloy surface. Multiple stable configurations of the acyl group are found on this surface. In the 3 adsorption configurations, the oxygen atom of the adsorbate stabilizes in 3 different sites, and the methyl has different orientations. In the left and right configurations, the oxygen atom resides between two Titanium atoms in the crossing the step of the catalyst surface. In the middle configuration, the oxygen atom resides between two adjacent Titanium atoms in the same height.

left
top
right

Interpolating for the transition path of adsorbates

Interpolation for the transition path between adjacent stable configurations of an acyl group on a stepped TiIr alloy. The acyl group goes through a diffusion between different adsorption sites with near free rotation of the methyl group.

bandgap03 legend

Sampling the distribution of adsorbates on the catalytic surface

Adsorption prediction results of single N and O atoms on catalyst surfaces, compared to DFT calculations. The probability distribution of adsorbate molecules on the corresponding catalyst surfaces is shown with the contour map. Drag the sliding bar to increase the number of samples of adsorbates on the catalytic surface.

Property-guided structure generation (Inverse design)

The following shows the diffusion process of carbon structures with the guidance of an electronic band gap predictor. Select different tabs to see the evolution process of typical carbon structures with different predicted band gaps. Note that multiple known carbon crystals are generated during the course including diamond and graphite, displaying significantly varied band gaps.

bandgap01
bandgap02
bandgap03

Three examples of electronic bandgap-guided generation of carbon structures with a target bandgap value of 0 eV. The sampling starts from a random unit cell, with carbon atoms randomly spread around the center of the cell. Both the lattice vectors and positions of carbon atoms are denoised with score ODE, with bandgap predictor gradients towards the target bandgap. With target bandgap value 0, the model has a higher probability of generating structure close to graphite (the left structure).

Approach

Description of the image

DiG attempts to predict the complicated equilibrium distribution of a given system by gradually transforming a simple distribution (e.g., a standard Gaussian) through the simulation of a predicted diffusion process that leads towards the equilibrium distribution. DiG employs Graphormer as the deep learning backbone to predict the diffusion process, which has shown superior performance in processing molecular structures and generalizing across various molecular systems. Through the diffusion process, molecular structure samples are generated by a step-by-step refinement, which breaks down the complication of the equilibrium distribution. The samples are generated independently in parallel, which bypasses the need of very long simulation of sequential sampling methods like molecular dynamics simulation. Besides sampling, DiG can also provide the normalized density estimation for the equilibrium distribution by tracking the change of probability along the diffusion process. Training a DiG can be done using flexible types of sources, including simulation data such as molecular dynamics trajectories, as well as the energy functions (force fields) of molecular systems. In all, DiG is a flexible and efficient framework that can handle the microscopic statistics of different types of molecular systems and descriptors.

BibTeX

@article{zheng2024predicting,
  title={Predicting equilibrium distributions for molecular systems with deep learning},
  author={Zheng, Shuxin and He, Jiyan and Liu, Chang and Shi, Yu and Lu, Ziheng and Feng, Weitao and Ju, Fusong and Wang, Jiaxi and Zhu, Jianwei and Min, Yaosen and others},
  journal={Nature Machine Intelligence},
  pages={1--10},
  year={2024},
  publisher={Nature Publishing Group UK London}
}