Machine Learning Seminar: Jason Eisner (Johns Hopkins CS Dept)
The language that we observe is only surface language. Beneath the surface form of a word, there are its underlying morphemes. Beneath a word sequence, there is a syntactic structure that connects the words. Even the surface punctuation is modified from some underlying punctuation. In this talk, I will describe some new computational approaches to recovering these latent structures. In the first part of the talk, I'll consider how to recover the "underlying forms" of morphemes along with the phonological processes that modify them into pronounceable words. I will offer a probabilistic generative model of the lexicon in which the underlying forms are latent string-valued variables. This is a graphical model over strings, where inference can be computationally hard. I will outline approximate algorithms based on Markov chain Monte Carlo, expectation propagation, and dual decomposition. In the second part of the talk (time permitting), I'll sketch some new work on unsupervised discovery of syntactic structure. I'll review why likelihood-based methods have failed in this setting. We have managed to avoid the traditional woes of this setting by converting it to a supervised learning problem, trained on the Galactic Dependencies treebanks, a collection of text in 50,000 human-like artificial languages that we created.