The purpose of these notes is to highlight the far-reaching connections between Information Theory and Statistics. Universal coding and adaptive compression are indeed closely related to statistical inference concerning processes and using maximum likelihood or Bayesian methods. The book is divided into four chapters, the first of which introduces readers to lossless coding, provides an intrinsic lower bound on the codeword length in terms of Shannon´s entropy, and presents some coding methods that can achieve this lower bound, provided the source distribution is known. In turn, Chapter 2 addresses universal coding on finite alphabets, and seeks to find coding procedures that can achieve the optimal compression rate, regardless of the source distribution. It also quantifies the speed of convergence of the compression rate to the source entropy rate. These powerful results do not extend to infinite alphabets. In Chapter 3, it is shown that there are no universal codes over the class of stationary ergodic sources over a countable alphabet. This negative result prompts at least two different approaches: the introduction of smaller sub-classes of sources known as envelope classes, over which adaptive coding may be feasible, and the redefinition of the performance criterion by focusing on compressing the message pattern. Finally, Chapter 4 deals with the question of order identification in statistics. This question belongs to the class of model selection problems and arises in various practical situations in which the goal is to identify an integer characterizing the model: the length of dependency for a Markov chain, number of hidden states for a hidden Markov chain, and number of populations for a population mixture. The coding ideas and techniques developed in previous chapters allow us to obtain new results in this area. This book is accessible to anyone with a graduate level in Mathematics, and will appeal to information theoreticians and mathematical statisticians alike. Except for Chapter 4, all proofs are detailed and all tools needed to understand the text are reviewed.
This book focuses on statistical inferences related to various combinatorial stochastic processes. Specifically, it discusses the intersection of three subjects that are generally studied independently of each other: partitions, hypergeometric systems, and Dirichlet processes. The Gibbs partition is a family of measures on integer partition, and several prior processes, such as the Dirichlet process, naturally appear in connection with infinite exchangeable Gibbs partitions. Examples include the distribution on a contingency table with fixed marginal sums and the conditional distribution of Gibbs partition given the length. The A-hypergeometric distribution is a class of discrete exponential families and appears as the conditional distribution of a multinomial sample from log-affine models. The normalizing constant is the A-hypergeometric polynomial, which is a solution of a system of linear differential equations of multiple variables determined by a matrix A, called A-hypergeometric system. The book presents inference methods based on the algebraic nature of the A-hypergeometric system, and introduces the holonomic gradient methods, which numerically solve holonomic systems without combinatorial enumeration, to compute the normalizing constant. Furher, it discusses Markov chain Monte Carlo and direct samplers from A-hypergeometric distribution, as well as the maximum likelihood estimation of the A-hypergeometric distribution of two-row matrix using properties of polytopes and information geometry. The topics discussed are simple problems, but the interdisciplinary approach of this book appeals to a wide audience with an interest in statistical inference on combinatorial stochastic processes, including statisticians who are developing statistical theories and methodologies, mathematicians wanting to discover applications of their theoretical results, and researchers working in various fields of data sciences.
This tutorial text gives a unifying perspective on machine learning by covering both probabilistic and deterministic approaches -which are based on optimization techniques - together with the Bayesian inference approach, whose essence lies in the use of a hierarchy of probabilistic models. The book presents the major machine learning methods as they have been developed in different disciplines, such as statistics, statistical and adaptive signal processing and computer science. Focusing on the physical reasoning behind the mathematics, all the various methods and techniques are explained in depth, supported by examples and problems, giving an invaluable resource to the student and researcher for understanding and applying machine learning concepts. The book builds carefully from the basic classical methods to the most recent trends, with chapters written to be as self-contained as possible, making the text suitable for different courses: pattern recognition, statistical/adaptive signal processing, statistical/Bayesian learning, as well as short courses on sparse modeling, deep learning, and probabilistic graphical models. All major classical techniques: Mean/Least-Squares regression and filtering, Kalman filtering, stochastic approximation and online learning, Bayesian classification, decision trees, logistic regression and boosting methods. The latest trends: Sparsity, convex analysis and optimization, online distributed algorithms, learning in RKH spaces, Bayesian inference, graphical and hidden Markov models, particle filtering, deep learning, dictionary learning and latent variables modeling. Case studies - protein folding prediction, optical character recognition, text authorship identification, fMRI data analysis, change point detection, hyperspectral image unmixing, target localization, channel equalization and echo cancellation, show how the theory can be applied. MATLAB code for all the main algorithms are available on an accompanying website, enabling the reader to experiment with the code.