Machine Learning Seminar: Scalable probabilistic inference for large-scale genomic data
The cost of genome sequencing has decreased by over 100,000 fold over the last decade. This genomic revolution is now enabling us to measure how our genomes vary at millions of positions across millions of individuals opening up the possibility of answering fundamental questions in human genetics. I will describe our work at the intersection of statistics, computer science and genomics aimed at leveraging these large-scale genomic datasets to answer questions such as how human populations evolved and what are the genes underlying diseases. We will describe two techniques that are commonly used in the analysis of human genetic data: principal components analysis (PCA) and variance components analysis. With the advent of large-scale datasets of genetic variation, there is a need for methods that can perform these analyses with scalable computational and memory requirements. Leveraging randomized method-of-moments estimators and the structure of genetic variation data, we obtain sub-linear time algorithms for these problems. These algorithms allow us to efficiently estimate variance components as well as top principal components, for example, in less than an hour on genome-wide genetic variation datasets from a million individuals. Applying these methods to about half a million individuals from the UK, we obtain novel biological insights.