Select Page

A Virtual Locksmith for Proteins

By Rob Mitchum // March 26, 2014

The interaction between proteins and small molecule ligands is often described as akin to a lock and key. The complex, three-dimensional geometry of a protein’s structure creates selective and unique binding sites that are only accessible to certain molecules positioned in just the right way. To discover new drugs that can ward off a bacterial infection or silence a renegade ion channel, scientists have to first determine the protein’s structure (the shape of the “lock”) and then test millions of molecules to find the right “key.”

At the Center for Structural Genomics of Infectious Diseases(CSGID), scientists from Argonne National Laboratory, the University of Chicago, and the Computation Institute created a new computational pipeline to rapidly perform this feat of molecular locksmithing, a process called virtual ligand screening. APPLIED (Analysis Pipeline for Protein–Ligand Interactions and Experimental Determination) was built from freely-available software packages developed by Argonne, UCSF, The Scripps Research Institute, and Harvard University, as well as data from the Protein Data Bank. But you probably won’t want to use APPLIED at home — CSGID currently runs the pipeline on the Argonne Leadership Computing Facility’s 10-petaflop Mira supercomputer, with time awarded by the Department of Energy’s INCITE program.

“As far as i’m aware, APPLIED is the largest implementation of all these advanced methods; it’s something unique that we have,” said T. Andrew Binkowski, CI fellow and researcher at the CSGID. “There’s always a trade-off between what you want to do and how fast you make it work. By putting it on a big computer, we get rid of a lot of those restraints and minimize approximations, and that hopefully makes it more accurate.”

The primary mission of CSGID is to determine the three-dimensional structure for hundreds of proteins from disease-causing bacteria and viruses. Scientists use structural genomics to translate the genetic sequence of a protein into its structure, drawing upon experimental and computational methods that capture or predict the most likely conformation. But a protein’s structure is only the beginning of its story; scientists want to know what binds to the protein, and how that will change its behavior, leading to potential clinical breakthroughs.

So once CSGID solves the structure for a particular protein target, it can then be dropped into to the APPLIED pipeline to look for potential ligand partners. First, software called SurfaceScreen scans the 3-D structure for probable binding sites, by comparing the structure to already-determined proteins in the Protein Data Bank. Then, the pipeline tests potential ligands from the ZINC library — which catalogs tens of millions of commercially available compounds produced by vendors — to look for likely partners.

Virtual ligand screening tries to maximize two different measures: docking and scoring. Docking is concerned with finding the right “pose” for the ligand to bind with the protein, while the scoring looks for the highest binding affinity — how strongly the two molecules connect. APPLIED tackles both problems sequentially, first using software called DOCK and AUTODOCK to find the right poses, then using surface area and molecular dynamics models to calculate the scoring.

The massive scale of the search, combined with the mathematical complexity of simulating the chemistry of ligand-protein binding, creates the high computational demands of the pipeline. A full run takes 500,000 computing hours — which is a lot for your laptop to handle, but translates to only 30 minutes on Mira, with the pipeline’s optimization for parallel computing using the CI’s Swift language. The result of each run is a library of the most likely binding partners for the target protein — a list of promising suspects that can then be further probed through experiments. For protein structures that CSGID only recently solved for the first time, these libraries represent a massive information dump of potential scientific and medical opportunities for a previously unknown piece of an infectious organism.

“A lot of times the proteins we’re interested in, we know the organism is bad, but we don’t know what some of its proteins do. This pipeline allows us to interrogate them,” Binkowski said. “We consider it to be functional characterization of important proteins in human health and disease, that can then be used for very early stage drug discovery.”

You can read more about APPLIED in Structural Genomics and Drug Discovery: Methods and Protocols. CSGID is also actively seeking collaborators for studies using APPLIED and other methods; researchers can submit a target of interest using this form.