ECE seminar: Towards Automated Operation of Computing Systems
Today's large-scale computers, such as high-performance computing clusters or the cloud, experience growing challenges in delivering predicable performance, as well as in efficiency, resilience, and security. Much of computer system management has traditionally relied on expert analysis and manual hands-on diagnostics. In this talk, I will demonstrate my group's recent work on designing "automated analytics" methods for computing systems, following a longer term vision where complex computing systems are able to self-manage and improve. Specifically, I will first talk about how to systematically diagnose root causes of performance variations (or "anomalies") on large-scale computers, which can cause substantial efficiency losses and higher cost. Second, I will introduce machine learning-based methods to discover applications on HPC or cloud systems and discuss how such discovery can help reduce vulnerabilities and avoid unwanted applications. This talk will also highlight methods for meaningful data collection from computing systems and point out future directions in automating computing system management.