The Evolution and Impact of Predictive Protein Folding Algorithms
Within the field of proteomics, “computational structure-based protein design is one of the most promising tools for engineering proteins with new functions”.[1] The process at which a chain of amino acids, a polypeptide being synthesized, folds upon itself to form a 3D protein structure, is complex and governed by a series of sulfur-sulfur covalent bonds, the covalent bonds holding the individual residues together [2], hydrophobic interactions, as well as the weaker non-covalent bonds; hydrogen bonds, electrostatic attractions and van der Waals attractions.[3] If one were able to predict, with 100% certainty, the final folded shape of a protein, based upon its amino acid sequence, a whole new field of custom protein engineering could be formed. Imagine a software program showing a 3D protein scaffold, where the researcher could turn and manipulate the protein to view it at any angle, along its X, Y or Z axes. The program could have a menu of protein substructures, features, domains, reactant and docking sites, etc., where the user could select them and add them to the protein at the desired location. Once the researcher has designed the desired protein, the system could generate a complementary amino acid sequence for synthesis in a laboratory.
A protein folds into a final conformation in which its free energy is minimized. It is this free energy minimization that is at the heart of the protein folding algorithms. The empirical potential functions are a set of mathematical functions describing the various energy models involved, where forces between the particles and their potential energies are calculated using interatomic potentials or molecular mechanics force fields.[4]
Locating the global minimum across the empirical potential functions is complicated due to the multitude of localized minima over the multidimensional energy surface. The larger the surface, or longer the peptide chain, the more complex. Researchers at Cornell University in 1987 employed a Monte Carlo minimization approach and were able to locate the lowest-energy minimum for enkephalin, a brain pentapeptide, in the absence of water.[5] Monte Carlo methods depend on continued, repetitious, random sampling in order to obtain numerical results and rely on the optimization of this randomness to solve what some may consider to be a deterministic problem. Their work suggested the process of protein folding may be a Markov process; a state machine where one can predict the next state based upon output parameters. A hidden Markov model, which best represents the protein folding problem, is where the states are unknown or hidden, but the input and output parameters are known.[6]
A few years later Seetharamulu and Crippen detail their efforts for “locating nearly the global minimum [energy of a polypeptide chain] as a means of predicting globular protein conformation.”[7] Their work mentions numerous pre-existing programs for calculating the quantum mechanical and empirical energies. And numerous programs for minimizing the energy as a function of conformation. Their endeavour towards optimization of the empirical potential functions allowed them to correctly predict the native conformation of the proteins apamin and melittin, where previous programs failed. This proved to be a great advance as they were able to predict the structure of actual, albeit small (<30 amino acids), proteins.
The problem still exists today and seems as though the cumulative research is only asymptotically solving the problem. The early focus and methods around free energy minimization are still being used. The issue now being the computational expense of the models. Even with the advances in microprocessors and general progress in computing power from the past two decades, the search space itself becomes astronomically large with larger amino acid chains. This is mostly due to the free rotation of the peptide-peptide bonds and the energy models themselves being excessively complex. As such, the success of fold prediction algorithms has been limited to small scale proteins (<150 amino acids). To deal with this computational expense issue, researchers are working to limit unnecessary computations by working through an on-lattice model. The idea being to reduce the search space and the complexity of the hydrophobic-polar energy model. On a set of benchmark proteins, they claim the optimized genetic algorithm “significantly outperforms state-of-the-art algorithms.”[8]
Researchers Bošković and Brest are using a differential evolution algorithm against a three-dimensional AB off-lattice model. The three-dimensional structure of an AB sequence is defined by bond angles, torsional angles and the unit-length chemical bond between the two adjacent amino acids. This simplified AB off-lattice model, by taking into account the hydrophobic interactions, utilizes the main driving forces of protein structure formation. “This model realistically imitates the main features of protein structure.”[9]
Other methodologies utilize intermediary steps in the three-dimensional protein structure prediction like structural class predictions (SCP) and protein fold recognition (PFR). Classifying a protein to its structure or fold can be used as a precursor to predicting the final conformation. SCP and PFR utilize syntactic and evolutionary-based information in addition to physicochemical-based information to extract features. Raicar, et al., focused their efforts on the physicochemical-based information. “Features which are dependent on physicochemical attributes can reveal global properties of proteins.”[10] The concept of adding information and knowledge to the models is adding momentum. From using surface accessibility data obtained from nuclear magnetic resonance spectroscopy[11] to the prediction of protein folding rates using a fuzzy cognitive map model based on deep learning neural networks.[12] Each of these advances incrementally improves the accuracy and convergence of protein fold prediction.
The protein fold problem represents one of the most challenging and yet promising problems in bioinformatics. The problem attracts researchers from a variety of disciplines including chemistry, biology, physics, mathematics and computer science. Most approaches utilize an energy model and an algorithm to process the model. It is clear that the two must evolve in tandem for future success. By improving the energy models and adding new information as it is understood, we gain a more thorough representation of the real world. But a more thorough representation within the energy models does us no good if the algorithms are not able to utilize the information, or may even be hindered by using it incorrectly. Some of the latest approaches, of minimizing the search space, predicting fold rates, and utilizing surface accessibility data gleaned from other sources offers improvements to both the models and the algorithms used to process them. This, it seems, will be the focus of efforts in the near term.
Since the protein fold prediction problem has had substantial attention paid to it over the past two decades, yet appears to only be asymptotically solved, the promise of large custom proteins seems fleeting. The day when a researcher can sit down at a computer terminal and design a substantial, custom, protein, complete with the structures and features desired, may not come for a while. But the implications for proteomics and drug discovery are immense and each algorithmic advance in predicting final protein conformations, gets us a bit closer to the realization.
References
Gainza P, Nisonoff HM, Donald BR. Algorithms for protein design. Current Opinion in Structural Biology. 2016;39:16–26.
Yeargers EK, Shonkwiler RW, Herod JV. An introduction to the mathematics of biology: with computer algebra models. Boston: Birkhäuser; 1996.
Alberts B, Bray D, Hopkin K, Johnson A, Lewis J, Raff MC, et al. Essential cell biology. New York, NY: Garland Science; 2014.
Molecular Dynamics [Internet]. Wikipedia. Wikimedia Foundation; [cited 20206Sep20]. Available from: https://en.wikipedia.org/wiki/Molecular_dynamics
Li Z, Scheraga HA. Monte Carlo-minimization approach to the multiple-minima problem in protein folding. Proceedings of the National Academy of Sciences. 1987Jan;84(19):6611–5.
Markov model [Internet]. Wikipedia. Wikimedia Foundation; [cited 2020Sep20]. Available from: https://en.wikipedia.org/wiki/Markov_model
Seetharamulu P, Crippen GM. A potential function for protein folding. Journal of Mathematical Chemistry. 1991;6(1):91–110.
Rashid MA, Khatib F, Hoque MT, Sattar A. An Enhanced Genetic Algorithm for Ab Initio Protein Structure Prediction. IEEE Transactions on Evolutionary Computation. 2020Sep;20(4):627–44.
Bošković B, Brest J. Differential evolution for protein folding optimization based on a three-dimensional AB off-lattice model. Journal of Molecular Modeling. 2016;22(10).
Raicar G, Saini H, Dehzangi A, Lal S, Sharma A. Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids. Journal of Theoretical Biology. 2016;402:117–28.
Hartlmüller C, Göbl C, Madl T. Prediction of Protein Structure Using Surface Accessibility Data. Angewandte Chemie. 2016;128(39):12149–53.
Liu L, Ma M, Cui J. A novel model-based on FCM–LM algorithm for prediction of protein folding rate. Journal of Bioinformatics and Computational Biology. 2017;15(04):1750012.
Robert Fawcett September 30, 2020