Artificial intelligence (AI) has solved one of the great challenges of biology: predicting how proteins are folded by a chain of amino acids into 3D shapes that perform life’s tasks. This week, the organizers of a protein folding competition announced the achievement of the researchers from DeepMind, the UK-based artificial intelligence company. They say that the DeepMind method will have far-reaching consequences, among which it will drastically accelerate the creation of new drugs.
“What the DeepMind team has been able to achieve is fantastic and will change the future of structural biology and protein research,” said Janet Thornton, honorary director of the European Institute of Bioinformatics. “It’s a 50-year problem,” added John Moult, a structural biologist at the University of Maryland, Shady Grove and co-founder of the Critical Assessment of Protein Structure Prediction (CASP) competition. I never thought I would see this in my life.
The body uses tens of thousands of different proteins, each from tens to hundreds of amino acids. The order of amino acids dictates how the countless thrusts and pulls between them give rise to complex 3D forms of proteins, which in turn determine how they function. Knowledge of these forms helps researchers develop drugs that can settle in protein cracks. And the ability to synthesize proteins with the desired structure can accelerate the development of enzymes for the production of biofuels and the decomposition of waste plastics.
For decades, researchers have deciphered protein structures using experimental techniques such as X-ray crystallography or cryo-electron microscopy (cryo-EM). But such methods can take years and do not always work. The structures are designed for only about 170,000 of the more than 200 million proteins found in various life forms.
In the 1960s, researchers realized that if they could work out all the interactions within a protein’s sequence, they could predict its shape. But the amino acids in each sequence can interact in so many different ways that the number of possible structures is astronomical. Computing scientists have jumped on the problem, but progress has been slow.
In 1994, Moult and colleagues launched CASP, which is held every 2 years. Participants receive amino acid sequences for about 100 proteins whose structures are unknown. Some groups compute a structure for each sequence, while others determine it experimentally. The organizers then compare the computational predictions with the laboratory results and give the predictions the result of a global distance test (GDT). Results above 90 on a 100-point scale are considered on par with experimental methods, says Moult.
Even in 1994, the predicted structures for small, simple proteins may correspond to experimental results. But for the larger challenge proteins, the GDT results of the calculations were about 20, a “complete disaster,” said Andrei Lupas, a CASP judge and evolutionary biologist at the Max Planck Institute for Developmental Biology. By 2016, competing groups had achieved results of about 40 for the hardest proteins, mostly by drawing insights from known protein structures that are closely related to the CASP goals.
When DeepMind competed for the first time, in 2018 its algorithm, called AlphaFold, relied on this comparative strategy. But AlphaFold also includes a computational approach called deep learning, in which software trains vast arrays of data — in this case, sequences and structures of known proteins — and learns to detect patterns. DeepMind won easily, beating the competition by an average of 15% for each structure and winning from GDT to about 60 for the toughest goals.
But the forecasts were still too rough, says John Jumper, who oversees AlphaFold’s development at DeepMind. “We knew how far we were from biological significance.” So the team combined deep learning with an “attention algorithm” that mimicked the way one can put together a puzzle: connecting pieces into lumps – in this case clusters of amino acids – and then this search for ways to join the lumps in a larger whole. Working with a computer network of about 128 machine learning processors, they train the algorithm of all 170,000 or so known protein structures.
And it happened. In this year’s CASP, AlphaFold achieved an average GDT score of 92.4. For the most challenging proteins, AlphaFold scored a median of 87.25 points above the next best predictions. It even surpasses solving the structures of proteins that are wedged into cell membranes, which are essential for many human diseases but are known to be difficult to solve with X-ray crystallography. Venki Ramakrishnan, a structural biologist in the Molecular Biology Laboratory of the Medical Research Council, called the result “stunning progress on the problem of protein folding.”
All groups in this year’s competition have improved, says Moult. But with AlphaFold, Lupas says, “The game has changed.” Organizers even worry that DeepMind may have cheated in some way. So Lupas poses a special challenge: a membrane protein of the archaean type, an ancient group of microbes. For 10 years, his team tried to get their own X-ray crystal structure. “We couldn’t solve it.”
But AlphaFold had no problems. Returned a detailed image of a three-part protein with two spiral arms in the middle. The model enabled Lupas and his team to understand their X-ray data; within half an hour, they had aligned their experimental results with the predicted structure of AlphaFold. “It’s almost perfect,” says Lupas. “It simply came to our notice then. I don’t know how they do it. “
As a condition of joining CASP, DeepMind – like all groups – agreed to disclose enough details about its method so that other groups could reproduce it. This will be beneficial for experimenters, who will be able to use structural predictions to make sense of opaque X-ray and cryo-EM data. It could also enable drug designers to craft the structure of each protein into new and dangerous pathogens such as SARS-CoV-2, a key step in the search for molecules to block them, Moult said.
Still, AlphaFold is not doing well. In CASP, it is shaken on a single protein, an amalgam of 52 small repeating segments that distort their positions as they assemble. Jumper says the team now wants to train AlphaFold to solve such structures, as well as those of protein complexes that work together to perform key functions in the cell.
Although one great challenge has fallen, there will undoubtedly be others. “That’s not the end of it,” Thornton said. This is the beginning of many new things.