Sensing, Signals, and Communications Seminar: Coded Distributed Computing with Viveck R. Cadambe
The scalability of distributed computing is limited by two aspects: (1) excessively slow stragglers or faulty nodes, (2) communication bottlenecks in the system. This talk presents tools and ideas inspired by coding theory to add redundancy to data/computation that promise to tackle these challenges.
For a major fraction of the talk, we focus on the simple task of matrix multiplication. Specifically, given two matrices A, B, we consider the problem of computing the product AB over P processing nodes in a fault-tolerant manner. We assume that each node receives two matrices that are respectively linear transformations of A,B, and multiplies the matrices they receive. The linear transforms are to be designed so that (a) they adhere to certain restrictions which model the memory/communication constraints of the processing nodes, and (b) the matrix-product AB is recoverable from any K of the P nodes. We aim to maximize the fault-tolerance, i.e., minimize the quantity K, under the given restrictions. We review some solutions to this problem in high performance computing literature from the 1980s and then present some recent polynomial-interpolation-based code constructions that outperform the classical solutions. We will also present some results related to the condition number of Chebyshev-Vandermonde matrices that pertain to the numerical stability of the code constructions.