Exploring Multi-Modal Machine Learning for Cancer Sub-Typing and Survival Analysis
This post is co-written with Francisco Azuaje from Genomics England.
Genomics England is at the forefront of analyzing sequenced genomes for The National Health Service (NHS) in the United Kingdom. They equip researchers to use data to advance biological research with the goal of helping people live longer, healthier lives. Genomics England is interested in leveraging machine learning to accurately identify cancer subtypes and severity, using a multi-modal approach.
To enhance their dataset, Genomics England has launched a multi-modal program and partnered with the AWS Global Health and Non-profit Go-to-Market (GHN-GTM) Data Science and AWS Professional Services teams to create an automatic cancer sub-typing and survival detection pipeline, exploring its accuracy on publicly available data.
Data Utilized
The proof of concept exercises used publicly available cancer research data from The Cancer Genome Atlas (TCGA). This data included high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcomes and histologic grade labels for breast cancer (BRCA) and gastrointestinal cancer types (Pan-GI).
Multi-Modal Machine Learning Frameworks
In the PoC exercises, three phases were executed to develop multi-modal subtyping and survival prediction ML pipelines. The first phase implemented the state-of-the-art Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE). Building on this, a novel architecture called Hierarchical Extremum Encoding (HEEC) was developed to address limitations observed in PORPOISE. The final phase introduced Hierarchical Image Pyramid Transformer (HIPT), a self-supervised learning-based model, to further enhance performance.
Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE)
PORPOISE consists of three sub-network components: CLAM, a self-normalizing network for molecular features, and a multi-modal fusion layer. While performant, PORPOISE exhibited reduced multi-modal performance in some scenarios, highlighting the need for further refinement.
Hierarchical Extremum Encoding (HEEC)
HEEC was developed by AWS to overcome the limitations of PORPOISE. Using tree ensembles and a novel encoding scheme, HEEC offers interpretable representations with enhanced spatial relationships for accurate prediction across multiple modalities.
Hierarchical Image Pyramid Transformer (HIPT)
HIPT employs self-supervised learning techniques for imaging modalities, showcasing improved performance compared to pre-trained encoders. The embeddings generated by HIPT offer benefits like faster training times and smaller feature footprints.
Architecture on AWS
To support the multi-modal ML workflows, a reference architecture was built using Amazon SageMaker, allowing for efficient model training and deployment. Key patterns included decoupling data pre-processing from model training, separating development and production environments, and leveraging CI/CD pipelines for automation.
Separation between Development and Production Environments
Genomics England maintains separate ML environments for testing and production, ensuring data integrity and security. Synthetic data is used in the testing environment to simulate real-world scenarios without compromising sensitive information.
Automation with CI/CD Pipelines
Automation techniques like CI/CD pipelines are crucial for error-free deployments and reproducibility between environments. These pipelines enable automated testing, code quality checks, and deployment of artifacts into target AWS accounts.
Conclusion
The collaboration with Genomics England highlights the potential of multi-modal machine learning in cancer research. By combining genomic and imaging data, advanced ML models can enhance cancer subtyping and survival analysis. The adoption of best practices and state-of-the-art frameworks ensures impactful research outcomes, pushing the boundaries of precision medicine.
Genomics England’s dedication to leveraging AWS cloud computing and cutting-edge technologies demonstrates their commitment to transforming healthcare through data-driven innovations.
Acknowledgements
The results in this post are based on data from The Cancer Genome Atlas (TCGA). Special thanks to Dr. Prabhu Arumugam, Director of Clinical Data and Imaging at Genomics England, and Francisco Azuaje, Director of Bioinformatics at Genomics England.
About the Authors
Cemre Zor, PhD: Senior Healthcare Data Scientist at AWS
Tamas Madl, PhD: Former Senior Healthcare Data Scientist at AWS
Epameinondas Fritzilas, PhD: Senior Consultant at AWS
Lou Warnett: Healthcare Data Scientist at AWS
Sam Price: Professional Services Consultant at AWS
Shreya Ruparelia: Data & AI Consultant at AWS
Pablo Nicolas Nunez Polcher, MSc: Senior Solutions Architect at AWS
Matthew Howard: Head of Healthcare Data Science at AWS
Tom Dyer: Senior Product Manager at Genomics England
Samuel Barnett: Applied Machine Learning Engineer at Genomics England
Prabhu Arumugam: Former Director of Clinical Data Imaging at Genomics England
Francisco Azuaje, PhD: Director of Bioinformatics at Genomics England
Stay tuned for more innovative updates in healthcare and data science from Genomics England and AWS!