Protein Design

Shuguang Zhang (MIT)

TA: Thras Karydis (DeepCure)

Class Outline: http://fab.cba.mit.edu/classes/S63.21/class_site/pages/class_3.html

Skills covered:

Homework

Part A: Exercises

1) Answer any of the following questions:

Part A: Protein Analysis

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins.

What do you know. It's 4am and I've picked this structure which is a domain of a protein that is part of the classification of Circadian Clock Proteins. And in this case specifically from Homo sapiens. I had come across some other CLOCK related proteins but were really confused by their interactions with other proteins and how they were sorted. So I chose to stick with this structure as it was easier to work with and identify certain structural motifs (as this is quite new to me).
Amino Acid Sequence (161 amino acids long): MSNEEFTQLMLEALDGFFLAIMTDGSIIYVSESVTSLLEHLPSDLVDQSIFNFIPEGEHSEVYKILSTHLLESDSLTPEYLKSKNQLEFCCHMLRGTIDPKEPSTYEYVKFIGNFKSLNSVSSSAHNGFEGTIQRTHRPSYEDRVCFVATVRLATPQFIKE The most frequent amino acid: Serine (19). How many protein sequence homologs are there for this protein? pBlast: 98 homologs Resolution: 2.31Å (2.05Å is the median resolution for x-ray crystallographic results in the Protein Data Bank). I do not know enough information on the entire system as this is just one of the domains. The resolution has classified this structure as being of median quality). Protein family: Circadian Clock Proteins

Visualizations

Cartoon
Lines
Surface
Sticks

Mesh
Ribbon

Part B: How to (almost) Fold (almost) Anything

In this part you will be folding protein sequences into 3D structures. The goal is to get an understanding on how computational protein modeling works as well as to see first hand the great computing power needed for molecular simulations in biology.

Folding Online (Robetta)

First, we will use an online Rosetta engine (Robetta) to get the feel of protein folding. Since folding is a computationally intense task, we will choose a protein with less than 100 amino-acids long, so we should get the result back in 2-3 days.

For this part of the assignment, I ended up searching on UniProt (and had fun learning about the way the database was organized). I narrowed my search down to organism and sequence length, and ended up sending an 84bp sequence to Robetta for folding. The sequence I chose was a toxin "Beta-mammal/insect toxin Ts1" which according to the UniProt datasheet, it says that this is the main neurotoxin of T.serrulatus, also known as the Brazillian Yellow Scorpion. I used to love reading about neurotoxins growing up and was always fascinated by their individual characteristics/ targets. So this exercise really satiated a curiosity/ started a new hobby.

Pro Challenge: Folding Offline (PyRosetta)

By following these instructions, you can install Rosetta on your machine and play around with folding proteins at quicker speeds compared to Robetta.

Currently in the process of setting up PyRosetta but I'm having a lot of trouble with ZSH working with Anaconda in Terminal. Will probably explore more of this later on!

3D Print Your Protein!

Sending out the protein to be printed.

Part C: Protein Design by Machine Learning

Finally, we would like to design proteins, or optimize existing proteins to have better (or worse) properties and abilities. Over the last few years, machine learning based approaches have revolutionized the protein folding and design field. In this part, we are going to follow along a Jupyter Notebook based on a paper by Facebook AI Research: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2020)

Embeddings are a concept in machine learning that started in Natural Language processing and is now used to understand the "language" of protein sequences. Unsupervised ML model is a model that only gets data (X) without any labels, and tries to make sense of the patterns within the data. Embeddings are the product of an unsupervised ML model that tries to transform raw data (e.g. amino acid sequence) into a space while retaining useful properties (e.g. proteins with different biochemical properties will be further apart in that space). What's very useful in embeddings that we can use them for unseen data (e.g. new protein sequence) and get new insights. Another thing we can do is to "travel" within a space - given a protein's embedding, can we make it shine brighter? more hydrophobic? heat-resistant?

In this notebook, we are going to use pre-trained protein embeddings by FAIR to train a downstream model that predicts the effect of missense mutations on the sequence of the protein Beta-lactamase. The original data is from the Envision paper (Gray, et al. 2018)

K-nearest-neighbors: SpearmanrResult(correlation=0.769765963571508, pvalue=2.162257615831918e-212)

SVM: SpearmanrResult(correlation=0.7802478215668263, pvalue=6.367922915449953e-222)

Random Forest Generator: SpearmanrResult(correlation=0.7200744737184535, pvalue=2.777521793929168e-173)