Large deviations: Cramér vs Sanov
Posted on August 03, 2012
5 minute read
I have recently been reading up on two classical results from large deviation theory: the Cramér-Chernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.
It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…
The Cramér-Chernoff TheoremPermalink
Let \(X, X_1, ..., X_n\) be independent, identically distributed random variables, and \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\) let \(\mu(\lambda) = \log \E[e^{\lambda X}]\) be their cumulant generating function. Then many standard concentration inequalities (Hoeffding, Bernstein, Bennett) can be derived [1, Appendix A.1] from a single basic result:
\[\Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) \leq e^{-n\{\lambda a - \mu(\lambda)\}}\qquad(\lambda > 0).\]This result can easily be proved using Markov’s inequality and is non-trivial as long as \(a > \E[X]\). Its ubiquity is explained by the Cramér-Chernoff theorem [2], which states that it asymptotically achieves the best possible exponent if we optimize our choice of \(\lambda\):
\[\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) = \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\]Sanov’s TheoremPermalink
The empirical distribution \(P_n\) of \(X_1, \ldots, X_n\) gives probability \(P_n(A) = \frac{1}{n}\sum_{i=1}^n {\bf 1}_{\{X_i \in A\}}\) to any event \(A\), which is the fraction of variables taking their value in \(A\). If the distribution of the variables is \(P^*\), then asymptotically we would expect \(P_n\) to be close to \(P^*\). So then if \(\mathcal{P}\) is a set of distributions that is far away from \(P^*\) in some sense, the probability that \(P_n \in \mathcal{P}\) should be small. This intuition is quantified by Sanov’s theorem [3], [4].
To avoid measure-theoretic complications, let us assume that the variables \(X_1, \ldots, X_n\) are discrete, with a finite number of possible values. Suppose \(\mathcal{P}\) is a convex set of distributions and assume that the information projection
\[Q^* = \argmin_{P \in \mathcal{P}}\ D(P\|P^*)\]of \(P^*\) on \(\mathcal{P}\) exists, where \(D(P\|P^*) = \sum_x P(x) \log \frac{P(x)}{P^*(x)}\) is the Kullback-Leibler divergence. Then \(D(Q^*\|P^*)\) provides an information-theoretic measure of the “distance” of \(\mathcal{P}\) from \(P^*\), and indeed [3]
\[\Pr\Big(P_n \in \mathcal{P}\Big) \leq e^{-nD(Q^*\|P^*)}.\]Moreover, Sanov’s theorem tells us that this bound is asymptotically tight in the exponent:
\[\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(P_n \in \mathcal{P}\Big) = D(Q^*\|P^*).\]Csiszár [3, p. 790] has an elegant information-theoretic proof of the upper bound. He also works out sufficient measure-theoretic conditions for the theorem to hold in continuous spaces, which are quite clean, but still require considerable care to verify.
One may extend Sanov’s theorem to non-convex sets \(\mathcal{P}\) by a union bound argument. For example, Cover and Thomas [4] take a union bound over all possible values for \(P_n\), which they call types. By discretization arguments one may further extend the theorem to infinite spaces [5], but then things get a bit too asymptotic for my taste.
From Sanov to Cramér-ChernoffPermalink
It turns out that the Cramér-Chernoff theorem can be obtained from Sanov’s theorem. (This is called contraction.) The trick is to observe that
\[\E_{X \sim P_n}[X] = \frac{1}{n} \sum_{i=1}^n X_i,\]so if we define the convex set \(\mathcal{P} = \{P \mid \E_{X \sim P}[X] \geq a\}\), then
\[\Pr\Big(P_n \in \mathcal{P}\Big) = \Pr\Big(\frac{1}{n} \sum_{i=1}^n X_i \geq a\Big).\]It remains to evaluate \(D(Q^* \| P^*)\) in this case, which can be done as follows:
Introduce a Lagrange multiplier to obtain:
\[\min_{P \in \mathcal{P}} D(P\|P^*) = \min_{\{P \mid \E_{X \sim P}[X] \geq a\}} D(P\|P^*) = \min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.\]Use Sion’s minimax theorem to swap the order of the min and the sup:
\[\min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} = \sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.\]Recognize the following convex duality for Kullback-Leibler divergence:
\[\sup_P\ \big(\E_{X \sim P}[\lambda X] - D(P||P^*\big) = \mu(\lambda),\]where \(\mu(\lambda) = \log \E_{P^*}[e^{\lambda X}]\) is the cumulant generating function defined above. We get:
\[\begin{split}\sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} &= \sup_{\lambda > 0}\Big\{\lambda a -\sup_P\Big(\E_P[\lambda X]- D(P\|P^*)\Big)\Big\}\\&= \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\end{split}\]Chaining everything together we exactly recover the Cramér-Chernoff theorem, and we see that the upper bounds have exactly the same constants.
From Cramér-Chernoff to SanovPermalink
One may also view things the other way around. The Cramér-Chernoff theorem bounds the probability that the value of the empirical mean \(\frac{1}{n} \sum_{i=1}^n X_i\) lies in the set \(A = [a,\infty)\). As discussed by [2, pp. 208-210], both the notion of empirical mean and the set \(A\) can be generalized. In particular, one may regard the empirical distribution \(P_n\) as the mean of \(n\) point-masses (i.e. Dirac measures) on the points \(X_1, \ldots, X_n\). Van der Vaart then presents Sanov’s theorem as just one instance of such generalized Cramér-Chernoff theorems.
ConclusionPermalink
We have seen the close similarities between the Cramér-Chernoff theorem and Sanov’s theorem. For me Sanov’s theorem seems easier to interpret, but in continuous spaces one has to deal with the more complicated measure-theoretic conditions of Csiszár [3]. For technical reasons it may therefore be preferable to use the Cramér-Chernoff result.
Last WordsPermalink
It turns out that the upper bound in the Cramér-Chernoff theorem does leave some slack in the order of \(1/\sqrt{n}\), which is negligible compared to the term in the exponent.
ReferencesPermalink
N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games. Cambridge University Press, 2006. A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press, 1998. I. Csiszár, “Sanov Property, Generalized I-Projection and a Conditional Limit Theorem,” The Annals of Probability, vol. 12, no. 3, pp. 768–793, 1984. T. M. Cover and J. A. Thomas, Elements of Information Theory, Second Edition. John Wiley & Sons, 2006. F. den Hollander, “Large deviations,” in Large Deviations, vol. 14, American Mathematical Society, 2000.网址:Large deviations: Cramér vs Sanov https://mxgxt.com/news/view/1708032
相关内容
Large彭旭辉
系统评价方法学质量评价工具AMSTAR
UFC299:宋亚东vs严,佩奇vs霍兰德,伯恩斯vs马达莱纳,加里vs尼尔
Large Avatar Model:单图打造写实3D交互数字人,跨平台驱动渲染
宋亚东vs彼得·严!UFC 299阵容公布:佩奇vs霍兰德,伯恩斯vs马达莱纳,加里vs尼尔
撞脸明星对比照,太像了,张雨绮VS宋慧乔,孙俪VS蒋勤勤
小孩VS桑吉尔夫!《机动战队VS》邀请赛第四日打响!
医生vs明星小说
vs是对比的意思吗