mRNA Vaccine Degradation Prediction

Developed deep learning models to predict mRNA degradation rates for the Stanford OpenVaccine Kaggle competition.

Bioinformatics Deep Learning Kaggle RNA Regression

Project Summary

This project tackled the Stanford OpenVaccine Kaggle competition by developing an ensemble of deep learning models (including GNNs and LSTMs/Transformers) to predict mRNA degradation rates. The goal was to improve mRNA vaccine stability by accurately forecasting degradation based on sequence and structural data, evaluated using the MCRMSE metric. The approach achieved competitive results, demonstrating the effectiveness of combining diverse models for complex bioinformatics tasks.

Introduction & Background

The stability of messenger RNA (mRNA) is a critical factor in the development of effective vaccines and therapeutics. Unstable mRNA degrades quickly, reducing its efficacy. The OpenVaccine competition challenged participants to predict the degradation rate at specific points along an mRNA molecule, providing valuable insights for designing more robust candidates.

The core task involved predicting five different experimental degradation measurements for the first 68 nucleotides, given the full sequence (107 nucleotides) and its predicted secondary structure.

Methodology

Input Data & Features

The models were trained on RNA sequences, predicted secondary structures (dot-bracket notation), and predicted loop types. Example:

Sequence: GGAAAAUGCGACUUGAGUACGGAAAAGUAC... Structure: .....(((.((...))))))........(... Loop Type: EEEEESSSHHHISSSSSSEEBBBBBSEEE...

Key features engineered or learned included:

Sequence k-mers and learned sequence embeddings.
Structural properties like base pairing status, loop type encodings, and distances within the structure.
Graph representations for GNNs, capturing connectivity.

Illustrative Base Pairing

Modeling Approach

An **ensemble learning** strategy was employed, combining predictions from models optimized for different aspects of the data:

Sequence Models (LSTM/Transformer): Captured linear sequence patterns.
Structure Models (GNN): Directly learned from the predicted 2D graph structure.
Feature Models (Gradient Boosting): Leveraged specific engineered features.

This multi-pronged approach aimed to create a more robust and accurate final prediction by averaging the outputs.

Ensemble Model Flow

Input data processed by diverse models, with outputs combined for the final prediction.

Results & Conclusion

The ensemble strategy yielded strong results, demonstrating the value of combining diverse modeling approaches. The final Mean Columnwise Root Mean Squared Error (MCRMSE) achieved on the competition's public leaderboard was approximately 0.26 (example score, lower is better).

Illustrative Performance (MCRMSE per Target)

Example MCRMSE scores per target variable (lower indicates less error).

This project successfully applied advanced machine learning to a challenging bioinformatics problem. It highlighted the importance of integrating diverse data types (sequence, structure) and model architectures for predicting complex biological phenomena like mRNA degradation, contributing valuable insights for potential vaccine design improvements.