Minimax risk for missing mass estimation
Suppose you observe a sequence of words. What is the chance that the
next word is unseen? This is essentially the problem of missing mass
estimation. It has a wide variety of applications including language
modelling and species estimation. The most popular estimator - called
the Good-Turing estimator - was discovered while Good and Turing were
trying to crack the enigma code in World War II.
We consider squared-error risk for the estimation of missing mass from N samples drawn independently from an unknown distribution. For the Good-Turing estimator, we show that the worst-case risk (over all distributions) is between 0.608/N and 0.618/N. Further, the worst-case risk for any estimator, i.e. the minimax risk, is bounded between 0.25/N and 0.618/N. We will allude to recent improvements that establish the exact worst-case risk for the Good-Turing estimator and a significant sharpening of the constant for the minimax risk.
(Joint work with Nikhilesh Rajaraman and Ananda Theertha Suresh)
Andrew Thangaraj is currently a Professor in the Department of Electrical Engineering, IIT Madras. His research interests are in the broad areas
of information theory, error-control coding and information-theoretic
aspects of cryptography.