Evaluating and Interpreting Machine Learning Outputs in Genomics Data
Abstract: In my dissertation, we have developed statistical and computational tools to evaluate and interpret machine learning outputs in genomics data. The first two projects focus on single-cell RNA-sequencing (scRNA-seq) data. In project 1, we evaluated the fitting of widely-used distribution families on scRNA-seq UMI counts and concluded that UMI counts of polyclonal cells following gene-specific cell-type-specific NB distributions without zero- inflation. Based on this modeling, we proposed the working dispersion score (WDS) to select genes that differentially express across cell types. In project 2, we developed a new internal (unsupervised) index, Clustering Deviation Index (CDI), to evaluate cell label sets obtained from clustering algorithms. We conducted in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. We also benchmarked CDI by comparing it with other internal indices in terms of the agreement with external indices using high-quality benchmark label sets. In addition, we demonstrated that CDI is more computationally efficient than other internal indices, especially for million-scale datasets. In project 3, we proposed a model-agnostic hypothesis testing framework to interpret feature interactions from machine learning predicted outcomes. The simulation study results demonstrated large power while controlling the type I error rate.
Mentor: Jichun Xie, PhD
Zoom Link: Please contact firstname.lastname@example.org for details on how to join.