Optimizing Mutual Information in Deep Neural Networks

The recently proposed information bottleneck (IB) theory of deep nets suggests
that during training, each layer attempts to maximize its mutual information (MI)
with the target labels (so as to allow good prediction accuracy), while minimizing
its MI with the input (leading to effective compression and thus good generalization).
To date, evidence of this phenomenon has been indirect and aroused controversy
due to theoretical and practical complications. In particular, it has been
pointed out that the MI with the input is theoretically infinite in many cases of interest, and that the MI with the target is fundamentally difficult to estimate in high
dimensions. As a consequence, the validity of this theory has been questioned.
In this paper, we overcome these obstacles by two means. First, as previously
suggested, we replace the MI with the input by a noise-regularized version, which
ensures it is finite. As we show, this modified penalty in fact acts as a form of
weight-decay regularization. Second, to obtain accurate (noise regularized) MI
estimates between an intermediate representation and the input, we incorporate
the strong prior-knowledge we have about their relation, into the recently proposed
MI estimator of Belghazi et al. (2018). With this scheme, we are able to
stably train each layer independently to explicitly optimize the IB functional. Surprisingly, this leads to enhanced prediction accuracy, thus directly validating the
IB theory of deep nets for the first time.

Optimizing Mutual Information in Deep Neural Networks
Optimizing Mutual Information in Deep Neural Networks