Unfolding the protein folding problem
I was first exposed to a major problem in a scientific field in my third year of undergraduate studies. My friends and I sat entranced in a lecture hall on the first floor of the Bahen Centre as we learned about the P versus NP problem. This problem requires its very own blog post, but basically, the main question of this problem is whether a certain class of hard computational problems can be solved in a “reasonable” amount of time. At the time, some part of me used to imagine students and professors frantically scribbling away their ideas, drafting accidental new papers on their way to solving the enigma, and definitely drinking jug after jug of coffee. The very next semester, I learned about the protein folding problem, an instance of a very hard problem in biology. As an idealistic undergraduate student, I used to imagine that someday, someone would solve the P versus NP problem and that this would be the key towards solving the protein folding problem. Little did I know that it would be advances in machine learning that would change everything!
DNA was established as the hereditary material in 1944, with its three-dimensional double helix structure resolved in the 1950s. Proteins–encoded within DNA’s blueprint–perform various essential functions within organisms. Just think of insulin: this vital protein has extended the lifespan of so many individuals across the world. At the same time, the function of a protein is related to how it folds, that is, its structure. For example, insulin is inactivated at high temperatures, at least in part due to changes in its structure. Experimental methods for determining a protein’s structure are time-consuming, and can be especially complicated for certain types of proteins.
What if we turn to computational methods? After all, according to 1972 Nobel prize winner Christian Anfinsen, knowledge of the constituents of a protein–its amino acids–and the conditions in which the protein resides–such as temperature or solution acidity–are sufficient to know its structure. The difficulty comes with the various conformations that the amino acids can position themselves relative to each other. Think now that the median length of a human protein is 375 amino acids. If you had this many beads in a chain, in how many different ways could you re-structure the chain? The hardness of this problem is even confirmed by theoreticians.
In 1994, the Critical Assessment of protein Structure Prediction (or CASP) was founded as an attempt to solve the protein folding problem computationally. Basically, it is a competition that occurs every two years and to which participants submit their structure predictions for as yet uncharacterized proteins. Predictions are subsequently compared to experimental results. Until 2016, for some of the hardest proteins to characterize, the best accuracy achieved by any team was less than 50%. In 2018, DeepMind’s neural network program, AlphaFold, achieved the highest accuracy of a little less than 60% for the same category of proteins. Remarkably, in 2020, AlphaFold 2’s accuracy was more than 85% for these proteins: its computational predictions are very comparable to the experimentally resolved structures!
Imagine being able to know the structure of a protein quickly and accurately! Hundreds of thousands of predictions have already been made by DeepMind. Amongst those are even some protein structures related to the coronavirus responsible for COVID-19. Of course, experimental validation is key, but this is a breakthrough that many did not anticipate happening any time soon. Seeing this impressive work by DeepMind even in the midst of a pandemic only gives me hope for the future of bioinformatics and science at large!