Efficiently predicting RNA efficiency with SageMaker

SeniorTechInfo
4 Min Read



The Future of Gene Editing: Advancements in CRISPR Technology

The Future of Gene Editing: Advancements in CRISPR Technology

The clustered regularly interspaced short palindromic repeat (CRISPR) technology holds the promise to revolutionize gene editing technologies, which is transformative to the way we understand and treat diseases. This technique is based in a natural mechanism found in bacteria that allows a protein coupled to a single guide RNA (gRNA) strand to locate and make cuts in specific sites in the targeted genome. Being able to computationally predict the efficiency and specificity of gRNA is central to the success of gene editing.

Transcribed from DNA sequences, RNA is an important type of biological sequence of ribonucleotides (A, U, G, C), which folds into a 3D structure. Benefiting from recent advances in large language models (LLMs), a variety of computational biology tasks can be solved by fine-tuning biological LLMs pre-trained on billions of known biological sequences. The downstream tasks on RNAs are relatively understudied.

Solution Overview

Large language models (LLMs) have gained a lot of interest for their ability to encode syntax and semantics of natural languages. The neural architecture behind LLMs are transformers, which are comprised of attention-based encoder-decoder blocks that generate an internal representation of the data they are trained from (encoder) and are able to generate sequences in the same latent space that resemble the original data (decoder). Due to their success in natural language, recent works have explored the use of LLMs for molecular biology information, which is sequential in nature.

DNABERT Model

DNABERT is a pre-trained transformer model with non-overlapping human DNA sequence data. The backbone is a BERT architecture made up of 12 encoding layers. The authors of this model report that DNABERT is able to capture a good feature representation of the human genome that enables state-of-the-art performance on downstream tasks like promoter prediction and splice/binding site identification. We decided to use this model as the foundation for our experiments.

Fine-Tuning with LoRA

Fine-tuning all the parameters of a model is expensive because the pre-trained model becomes much larger. LoRA is an innovative technique developed to address the challenge of fine-tuning extremely large language models. LoRA offers a solution by suggesting that the pre-trained model’s weights remain fixed while introducing trainable layers (referred to as rank-decomposition matrices) within each transformer block. This approach significantly reduces the number of parameters that need to be trained and lowers the GPU memory requirements, because most model weights don’t require gradient computations.

Hold-out Evaluation Performances

We use RMSE, MSE, and MAE as evaluation metrics, and we tested with rank 8 and 16. Furthermore, we implemented a simple fine-tuning method, which is simply adding several dense layers after the DNABERT embeddings. The results show that LoRA outperforms other methods in predicting the efficiency of CRISPR-Cas9 RNA sequences.

Conclusion

In conclusion, the advancements in CRISPR technology, combined with the use of large language models and innovative fine-tuning techniques, show great promise in the field of gene editing. By harnessing the power of computational biology, researchers and scientists are able to revolutionize the way we approach disease treatment and genetic manipulation.

About the Authors

Siddharth Varia, Yudi Zhang, Erika Pelaez Coyotl, Zichen Wang, and Rishita Anubhai are a team of scientists and researchers at AWS who have a passion for leveraging AI and machine learning to make breakthroughs in healthcare and life sciences.


Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *