Out-of-distribution generalization from labelled and unlabelled gene expression data for drug response prediction

Paper: https://www.nature.com/articles/s42256-021-00408-w
Github: https://github.com/hosseinshn/Velodrome

1. Summary

Predicting one patient’s drug response based on genomics profile is a cruicial and tough task in that patient datasets with drug response are often small and hard to get acess.
Hence, the problem is that we can obtain the drug response data based on the cell lines,like CCLE or GDSC, or patient-derived xenografts (PDX), but not the patient drug response profile.
Transfer learning has emerged as a machine learning methods for such scenarios where we have access to to different datasets (source domain), and to make predictions on datasets interested (target domain), i.e., training a model with cell line drug response datasets and make prediction with patients’ preclinical profile such as genomics, transtriptomics and proteomics data.
The data of patient is significantly different from the data of cell lines (i.e., they come from a different distribution). And “out-of-distribution generalization” (OOD-generalization) refers to the ability of a model to perform well on data that is significantly different from the data it was trained on.
Here, the paper propose a new framework to predict drug response data that combines labeled and unlabeled gene expression data.

2. Key insights

The paper, in fact, maybe the first paper as far as I have known apply the OOD-generalization method to predict patients’ drug response.
Hence, the idea of the paper itself is a highlight. Moreover, the model framework is well designed. Also the case study of the paper is ingenious which is worth learnig.
I focus more about the model framework and model evluation design in methodologies and techniques part.

3. Methodologies and techniques

  • Model framework

    • Training (Fig.1):

      • Dataset(Source domain)
        • Labeled: GDSCv2, CTRPv2 (cell line drug response)
        • Unlabeled: TCGA patient (patient genomics profile)
    • Testing:

      • Dataset (Target domain)
  • Model evaluation in TCGA patient datasets

4. Critique and limitations

The field of transfer learning, particularly in computer vision, has experienced rapid development in recent years. In comparison to the latest domain adaptation algorithms, the deep learning model proposed in this paper may be considered relatively simple. Moreover, domain adaptation-based models are often characterized by low interpretability, which has limited their development in the field of bioinformatics. Also, one potential for imoprovement by incorporating multi-omics data in addition to genomics data.

Next