Machine learning models are revolutionizing cancer outcome predictions by analyzing vast amounts of patient data to identify patterns and insights that traditional methods often miss. However, efforts to build these models are limited by manual extraction of key data elements from unstructured data such as clinical notes and pathology reports. This process is time-consuming, error-prone, and limits scalability.
However, a recent study published in Nature has demonstrated a promising pathway toward overcoming these obstacles to build robust, high-performing machine learning models from clinicogenomic data by leveraging AI to automatically annotate free-text clinician notes and reports.
Using Multimodal, Structured Data to Improve Model Performance and Identify Biomarkers
The study, conducted by a team of researchers at Memorial Sloan Kettering Cancer Center (MSK), introduces the MSK-CHORD dataset, a compilation of real-world clinical, radiographic, histopathologic, laboratory, and genomic sequencing data from 24,950 patients. The researchers achieved this by combining automatically-generated natural language processing (NLP) annotations from clinician notes and histopathology and radiology reports with structured treatment, survival, tumour registry, demographic and tumour genomic data. They then used this data to build multimodal models to predict cancer outcomes, as well as to discover clinicogenomic relationships that were previously hidden within unstructured text.
The researchers found that machine learning models trained on MSK-CHORD outperformed models based solely on genomic data or disease stage alone at predicting overall survival. The researchers proposed that integrating high-dimensional data such as pathology images into MSK-CHORD could further improve predictive performance. By analyzing MSK-CHORD, the researchers also identified SETD2 as a promising biomarker for improved immunotherapy outcomes in lung adenocarcinoma patients.
Implications for Cancer Research and Therapeutic R&D
The MSK study underscores the transformative power of integrating diverse, structured data, enabling a more comprehensive and nuanced understanding of the disease. Multimodal datasets can be used not only to train high-performing predictive models that advance patient care, but for novel biomarker discovery and target identification that potentially expand the reach of targeted treatments.
At Proscia, we share the vision of advancing patient care through multimodal real-world data. Our real-world data offering integrates pathology images with associated clinical and genomic data, including structured data from molecular tests, pathology reports, laboratory tests, next generation sequencing, and image metadata. These richly-characterized cohorts provide deep insights and represent diverse patient populations with sufficient size to meet R&D requirements.
The findings by the MSK team further validate Proscia’s approach to data integration: combining diverse data modalities to deliver high-quality, structured datasets that maximize research value.
Every breakthrough in multimodal data integration reaffirms its immense potential to reshape cancer research and patient care. At Proscia, we work to ensure this valuable data is available and ready for analysis by the brightest minds in cancer research and drug development with the shared goal of bringing life-saving therapies to patients sooner.
Original Paper: Jee J, Fong C, Pichotta K, et al. Automated real-world data integration improves cancer outcome prediction. Nature. 2024;636(8043):728-736. doi:10.1038/s41586-024-08167-5
Ajay Basavanhally, Ph.D., is the Team Lead: Data Engineering & Science at Proscia