Machine Learning Seminar: Extracting Data from Tables and Charts in Natural Document Formats

Sponsor(s): Machine Learning, Bass Connections-Information, Society & Culture, Biomedical Engineering (BME), Biostatistics and Bioinformatics, Computational Biology and Bioinformatics (CBB), Computer Science, Electrical and Computer Engineering (ECE), Energy Initiative, Information Initiative at Duke (iiD), Mathematics, Social Science Research Institute (SSRI), and Statistical Science
Reception: 3 pm
Seminar: 3:30 pm
Financial analysis depends on accurate financial data, and these data are often distributed via PDF and other "natural document" formats. While these formats are optimized for easy human comprehension, automatically extracting the data can be quite challenging. We'll describe our work using a deep learning pipeline to extract data from tables and charts in PDF documents. We'll also show some of our latest research, inspired by image captioning models, for directly going from images of tables to a markup language (LaTeX) representation.
Contact: Ariel Dawn