Proteins are composed of 20 different amino acids, providing a vast range of functions and a high level of diversity. Because the possible space is too huge (10 to the 100th power of possible proteins even with 80 amino acids, which is greater than the number of atoms in the observable universe), it remains difficult to rationally design de novo proteins without a thorough understanding of the fundamental relationships between amino acid sequence, structure, and function. To overcome this challenge, we are combining large-scale measurements with machine learning, including deep learning, to gain a better understanding of the relationships between amino acid sequence, structure, and function. Then, based on this new knowledge, we design de novo proteins to verify and reanalyze the new findings. In this way, we repeat such analysis and design to achieve both an understanding of the fundamental laws of proteins and rational de novo protein design.
Understanding the Fundamental Laws of Proteins:
Anfinsen's dogma, the principle that "the amino acid sequence of a protein determines its structure, and its structure determines its function," was proposed 50 years ago, and with a few exceptions, this is still the fundamental law of proteins today. Although this law has become common knowledge in biology, it is still very difficult to accurately predict the structure, feature, and function of a protein from its amino acid sequence. However, with the rapid development of information science in recent years, especially deep learning, combined with the vast amount of data accumulated, it is now becoming possible to accurately predict protein structures. To accurately predict other features and functions of proteins, we will acquire the vast amount of data and build new deep learning models. This strategy will enable us to accurately predict the features and functions of proteins, as well as accurately understand the basic laws of proteins.
Rational design of proteins based on fundamental laws:
Although the possible number of proteins is enormous, the number of proteins used by living organisms is approximately 10 to the 12th power, meaning living organisms do not fully utilize the full potential of proteins. To fully utilize the potential of proteins, we are now designing "de novo proteins". For example, de novo proteins with fluorescence like GFP from jellyfish or luminescence like luciferase have been designed, as well as de novo proteins that can strongly inhibit the COVID-19 infection. However, because our understanding of the fundamental laws of proteins is incomplete, the design of de novo proteins is still "by chance". Our goal is not only to elucidate the fundamental laws of proteins but also to utilize the results to "rationally" design de novo proteins.