Genomics England predicts cancer subtypes and survival with Amazon SageMaker

SeniorTechInfo
5 Min Read

Exploring Multi-Modal Machine Learning for Cancer Sub-Typing and Survival Analysis

This post is co-written with Francisco Azuaje from Genomics England.

Genomics England is at the forefront of analyzing sequenced genomes for The National Health Service (NHS) in the United Kingdom. They equip researchers to use data to advance biological research with the goal of helping people live longer, healthier lives. Genomics England is interested in leveraging machine learning to accurately identify cancer subtypes and severity, using a multi-modal approach.

To enhance their dataset, Genomics England has launched a multi-modal program and partnered with the AWS Global Health and Non-profit Go-to-Market (GHN-GTM) Data Science and AWS Professional Services teams to create an automatic cancer sub-typing and survival detection pipeline, exploring its accuracy on publicly available data.

Data Utilized

The proof of concept exercises used publicly available cancer research data from The Cancer Genome Atlas (TCGA). This data included high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcomes and histologic grade labels for breast cancer (BRCA) and gastrointestinal cancer types (Pan-GI).

Multi-Modal Machine Learning Frameworks

In the PoC exercises, three phases were executed to develop multi-modal subtyping and survival prediction ML pipelines. The first phase implemented the state-of-the-art Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE). Building on this, a novel architecture called Hierarchical Extremum Encoding (HEEC) was developed to address limitations observed in PORPOISE. The final phase introduced Hierarchical Image Pyramid Transformer (HIPT), a self-supervised learning-based model, to further enhance performance.

Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE)

PORPOISE consists of three sub-network components: CLAM, a self-normalizing network for molecular features, and a multi-modal fusion layer. While performant, PORPOISE exhibited reduced multi-modal performance in some scenarios, highlighting the need for further refinement.

Hierarchical Extremum Encoding (HEEC)

HEEC was developed by AWS to overcome the limitations of PORPOISE. Using tree ensembles and a novel encoding scheme, HEEC offers interpretable representations with enhanced spatial relationships for accurate prediction across multiple modalities.

Hierarchical Image Pyramid Transformer (HIPT)

HIPT employs self-supervised learning techniques for imaging modalities, showcasing improved performance compared to pre-trained encoders. The embeddings generated by HIPT offer benefits like faster training times and smaller feature footprints.

Architecture on AWS

To support the multi-modal ML workflows, a reference architecture was built using Amazon SageMaker, allowing for efficient model training and deployment. Key patterns included decoupling data pre-processing from model training, separating development and production environments, and leveraging CI/CD pipelines for automation.

Separation between Development and Production Environments

Genomics England maintains separate ML environments for testing and production, ensuring data integrity and security. Synthetic data is used in the testing environment to simulate real-world scenarios without compromising sensitive information.

Automation with CI/CD Pipelines

Automation techniques like CI/CD pipelines are crucial for error-free deployments and reproducibility between environments. These pipelines enable automated testing, code quality checks, and deployment of artifacts into target AWS accounts.

Conclusion

The collaboration with Genomics England highlights the potential of multi-modal machine learning in cancer research. By combining genomic and imaging data, advanced ML models can enhance cancer subtyping and survival analysis. The adoption of best practices and state-of-the-art frameworks ensures impactful research outcomes, pushing the boundaries of precision medicine.

Genomics England’s dedication to leveraging AWS cloud computing and cutting-edge technologies demonstrates their commitment to transforming healthcare through data-driven innovations.

Acknowledgements

The results in this post are based on data from The Cancer Genome Atlas (TCGA). Special thanks to Dr. Prabhu Arumugam, Director of Clinical Data and Imaging at Genomics England, and Francisco Azuaje, Director of Bioinformatics at Genomics England.

About the Authors

Cemre Zor, PhD: Senior Healthcare Data Scientist at AWS

Tamas Madl, PhD: Former Senior Healthcare Data Scientist at AWS

Epameinondas Fritzilas, PhD: Senior Consultant at AWS

Lou Warnett: Healthcare Data Scientist at AWS

Sam Price: Professional Services Consultant at AWS

Shreya Ruparelia: Data & AI Consultant at AWS

Pablo Nicolas Nunez Polcher, MSc: Senior Solutions Architect at AWS

Matthew Howard: Head of Healthcare Data Science at AWS

Tom Dyer: Senior Product Manager at Genomics England

Samuel Barnett: Applied Machine Learning Engineer at Genomics England

Prabhu Arumugam: Former Director of Clinical Data Imaging at Genomics England

Francisco Azuaje, PhD: Director of Bioinformatics at Genomics England

Stay tuned for more innovative updates in healthcare and data science from Genomics England and AWS!

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *