CS-ECE Colloquium: Programming Statistical Machine Learning with High-Level Knowledge
Machine learning is fundamentally changing how software is developed. Rather than program behavior directly, many developers now curate training data and engineer features, but the process is slow, laborious, and expensive. In this talk I will describe two multi-year projects to study how high-level knowledge can be programmed more directly into statistical machine learning models. The resulting prototypes are used in dozens of major technology companies and research labs, and in collaboration with government agencies like the U.S. Department of Veterans Affairs and U.S. Food and Drug Administration.
The first project is Snorkel, a framework for training statistical models with multiple user-written rules instead of hand-labeled training data. This alternative supervision paradigm raises new questions in statistical machine learning, such as how to learn from noisy sources that can have rich dependency structures like correlations, and how to estimate these structures fast enough for interactive development. Snorkel powers applications, such as reading electronic health records, that otherwise would not admit a learning approach because of the difficulty in curating training data.
The second project is probabilistic soft logic (PSL), a probabilistic programming language for building large-scale statistical models over structured data like biological and social networks using logical rules.