Relating enhancer genetic variation across mammals to complex phenotypes using machine learning
Advances in the genome sequencing have provided a comprehensive view of cross-species conservation across small segments of nucleotides. These conservation measures have proven invaluable for associating phenotypic variation, both within and across species, to variation in genotype at protein-coding genes or highly conserved enhancers. However, these approaches cannot be applied to the vast majority of enhancers, where the conservation levels of individual nucleotides are often low even when enhancer function is conserved and where activity is tissue- or cell type-specific. To overcome these limitations, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT), in which convolutional neural network models learn the regulatory code connecting genome sequence to open chromatin in a tissue of interest, allowing us to accurately predict cases where differences in genotype are associated with differences in open chromatin in that tissue at candidate enhancer regions. We established a new set of evaluation criteria for machine learning models developed for this task and used these criteria to compare our models to models trained using different negative sets and to conservation scores. We then developed a framework for connecting these predictions to phenotypes in a way that accounts for the phylogenetic tree. When applying our framework to motor cortex, we identified dozens of new candidate enhancers associated with the evolution of brain size and vocal learning.
Contact Greg Crawford (greg.crawford at duke dot edu) and Debby Silver (debra.silver at duke dot edu) with any questions.