Via the Fisher Information (Kirsch and Gal 2022)
\[ \require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\implicitE}[1]{\opExpectation \left [ #1 \right ]} \DeclareMathOperator{\opVar}{\mathrm{Var}} \newcommand{\Var}[2]{\opVar_{#1} \left [ #2 \right ]} \newcommand{\implicitVar}[1]{\opVar \left [ #1 \right ]} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \newcommand{\sicof}[1]{h(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat{\opp}_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\Y}{Y} \newcommand{\y}{y} \newcommand{\X}{\boldsymbol{X}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\wstar}{\boldsymbol{\theta^*}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\D}{\mathcal{D}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\indep}{\perp\!\!\!\!\perp} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\specialHofHessian}[2]{\opEntropy''_{#1}[#2]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \newcommand{\specialHofJacobian}[2]{\opEntropy'_{#1}[#2]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\opCovariance}{\mathrm{Cov}} \newcommand{\Cov}[2]{\opCovariance_{#1} \left [ #2 \right ]} \newcommand{\implicitCov}[1]{\opCovariance \left [ #1 \right ]} \newcommand{\FisherInfo}{\HofHessian} \newcommand{\prelogits}{\boldsymbol{z}} \newcommand{\typeeval}{\mathrm{eval}} \newcommand{\typetest}{\mathrm{test}} \newcommand{\typetrain}{\mathrm{train}} \newcommand{\typeacq}{\mathrm{acq}} \newcommand{\typepool}{\mathrm{pool}} \newcommand{\xeval}{\x^\typeeval} \newcommand{\xtest}{\x^\typetest} \newcommand{\xtrain}{\x^\typetrain} \newcommand{\xacq}{\x^\typeacq} \newcommand{\xpool}{\x^\typepool} \newcommand{\xacqsettar}{\x^{\typeacq, *}} \newcommand{\Xeval}{\X^\typeeval} \newcommand{\Xtest}{\X^\typetest} \newcommand{\Xtrain}{\X^\typetrain} \newcommand{\Xacq}{\X^\typeacq} \newcommand{\Xpool}{\X^\typepool} \newcommand{\yeval}{\y^\typeeval} \newcommand{\ytest}{\y^\typetest} \newcommand{\ytrain}{\y^\typetrain} \newcommand{\yacq}{\y^\typeacq} \newcommand{\ypool}{\y^\typepool} \newcommand{\Ytest}{\Y^\typetest} \newcommand{\Ytrain}{\Y^\typetrain} \newcommand{\Yeval}{\Y^\typeeval} \newcommand{\Yacq}{\Y^\typeacq} \newcommand{\Ypool}{\Y^\typepool} \newcommand{\Xrate}{\boldsymbol{\mathcal{\X}}} \newcommand{\Yrate}{\mathcal{\Y}} \newcommand{\batchvar}{\mathtt{K}} \newcommand{\poolsize}{\mathtt{M}} \newcommand{\trainsize}{\mathtt{N}} \newcommand{\evalsize}{\mathtt{E}} \newcommand{\testsize}{\mathtt{T}} \newcommand{\numclasses}{\mathtt{C}} \newcommand{\xevalset}{{\x^\typeeval_{1..\evalsize}}} \newcommand{\xtestset}{{\x^\typetest_{1..\testsize}}} \newcommand{\xtrainset}{{\x^\typetrain_{1..\trainsize}}} \newcommand{\xacqset}{{\x^\typeacq_{1..\batchvar}}} \newcommand{\xpoolset}{{\x^\typepool_{1..\poolsize}}} \newcommand{\xacqsetstar}{{\x^{\typeacq, *}_{1..\batchvar}}} \newcommand{\Xevalset}{{\X^\typeeval_{1..\evalsize}}} \newcommand{\Xtestset}{{\X^\typetest_{1..\testsize}}} \newcommand{\Xtrainset}{{\X^\typetrain_{1..\trainsize}}} \newcommand{\Xacqset}{{\X^\typeacq_{1..\batchvar}}} \newcommand{\Xpoolset}{{\X^\typepool_{1..\poolsize}}} \newcommand{\yevalset}{{\y^\typeeval_{1..\evalsize}}} \newcommand{\ytestset}{{\y^\typetest_{1..\testsize}}} \newcommand{\ytrainset}{{\y^\typetrain_{1..\trainsize}}} \newcommand{\yacqset}{{\y^\typeacq_{1..\batchvar}}} \newcommand{\ypoolset}{{\y^\typepool_{1..\poolsize}}} \newcommand{\Ytestset}{{\Y^\typetest_{1..\testsize}}} \newcommand{\Ytrainset}{{\Y^\typetrain_{1..\trainsize}}} \newcommand{\Yevalset}{{\Y^\typeeval_{1..\evalsize}}} \newcommand{\Yacqset}{{\Y^\typeacq_{1..\batchvar}}} \newcommand{\Ypoolset}{{\Y^\typepool_{1..\poolsize}}} \newcommand{\yset}{{\y_{1..n}}} \newcommand{\Yset}{{\Y_{1..n}}} \newcommand{\xset}{{\x_{1..n}}} \newcommand{\Xset}{{\X_{1..n}}} \newcommand{\pdataof}[1]{\hpcof{\mathrm{true}}{#1}} \newcommand{\ptrainof}[1]{\hpcof{\mathrm{train}}{#1}} \newcommand{\similarityMatrix}[2]{{S_{#1}[#2]}} \newcommand{\HofJacobianData}[1]{{\hat{\opEntropy}'[#1]}} \newcommand{\HofJacobianDataShort}{\hat{\opEntropy}'} \newcommand{\Dacq}{\D^{\typeacq}} \newcommand{\Dtest}{\D^{\typetest}} \newcommand{\Dpool}{\D^{\typepool}} \newcommand{\Deval}{\D^{\typeeval}} \]
By lecture’s end, when you see methods like BADGE, BAIT, or PRISM, you’ll immediately recognize:
You’ll move from memorizing formulas to understanding the deeper structure.
Model: Consider a linear model: \(y = \x^T \w + \epsilon\), where \(\epsilon \sim \mathcal{N}(0, \sigma_n^2)\) is Gaussian noise with known variance \(\sigma_n^2.\) The parameters are \(\w \in \mathbb{R}^D\).
Bayesian Inference: The current posterior, based on observed data \(\D\) (which we drop for simplicity):
\[ \pof{\w} = \mathcal{N}(\w | \mu, \Sigma) \]
Goal: Select unlabeled data point \(\x^{acq}\) maximizing: \[ \text{EIG}(\x^{acq}) = \Hof{\W} - \E{\pof{\yacq \given \x^{acq}}}{\Hof{\W \given \yacq, \x^{acq}}} \]
Entropy of a Gaussian Distribution: For a multivariate Gaussian \(\pof{\W} = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), the differential entropy is \(\Hof{\W} = \frac{1}{2} \log \det(2\pi e \boldsymbol{\Sigma})\).
Current Parameter Entropy: The entropy of our current posterior \(\pof{\W}\) is:
\[ \Hof{\W} = \frac{1}{2} \log \det(2\pi e \Sigma) \]
If we observe \(\yacq\) for \(\x^{acq}\), the new posterior \(\pof{\w \given \yacq, \x^{acq}}\) is also Gaussian \(\mathcal{N}(\w | \mu_{new}, \Sigma_{new})\) with:
\[ \Sigma_{new}^{-1} = \Sigma^{-1} + \frac{1}{\sigma_n^2} \x^{acq} (\x^{acq})^T \]
Important
This updated covariance \(\Sigma_{new}\) does not depend on the observed value \(\yacq\). Only the mean \(\mu_{new}\) depends on \(\yacq\).
The entropy of the parameters given \(\yacq\) and \(\x^{acq}\) is \[ \Hof{\W \given \Yacq, \x^{acq}} = \frac{1}{2} \log \det(2\pi e \Sigma_{new}) \]
Since \(\Sigma_{new}\) (and thus this entropy term) does not depend on \(\yacq\), the expectation over \(p(\yacq | \x^{acq})\) is trivial:
\[ \E{\pof{\yacq \given \x^{acq}}}{\Hof{\W \given \yacq, \x^{acq}}} = \frac{1}{2} \log \det(2\pi e \Sigma_{new}) \]
\[ \begin{aligned} \text{EIG}(\x^{acq}) &= \Hof{\W} - \E{\pof{\Yacq \given \x^{acq}}}{\Hof{\W \given \Yacq, \x^{acq}}} \\ &= \frac{1}{2} \log \det(2\pi e \Sigma) - \frac{1}{2} \log \det(2\pi e \Sigma_{new}) \\ &= \frac{1}{2} \log \left( \frac{\det(\Sigma)}{\det(\Sigma_{new})} \right) = \frac{1}{2} \log \det(\Sigma \Sigma_{new}^{-1}) \end{aligned} \]
Substituting \(\Sigma_{new}^{-1} = \Sigma^{-1} + \frac{1}{\sigma_n^2} \x^{acq} (\x^{acq})^T\):
\[ \begin{aligned} \text{EIG}(\x^{acq}) &= \frac{1}{2} \log \det\left(\Sigma \left(\Sigma^{-1} + \frac{1}{\sigma_n^2} \x^{acq} (\x^{acq})^T\right)\right) \\ &= \frac{1}{2} \log \det\left(I_D + \frac{1}{\sigma_n^2} \Sigma \x^{acq} (\x^{acq})^T\right) \\ &\le \frac{1}{2} \tr \left( \Sigma \x^{acq} (\x^{acq})^T \right) \end{aligned} \] where \(I_D\) is the \(D \times D\) identity matrix.
\[ \det(I_M + AB) = \det(I_N + BA) \]
\[ \begin{aligned} A &:= \frac{1}{\sigma_n^2} \Sigma \x^{acq} \text{ (a $D \times 1$ column vector) and } \\ B &:= (\x^{acq})^T \text{ (a $1 \times D$ row vector)} \\ BA &= \frac{1}{\sigma_n^2} (\x^{acq})^T \Sigma \x^{acq} \text{ (a scalar)} \end{aligned} \]
\[ \implies \text{EIG}(\x^{acq}) = \frac{1}{2} \log \left(1 + \frac{1}{\sigma_n^2} (\x^{acq})^T \Sigma \x^{acq}\right) \]
\[ \begin{aligned} \text{EIG}(\x^{acq}) &= \frac{1}{2} \log \left(1 + \frac{v_f(\x^{acq})}{\sigma_n^2}\right) \\ &= \frac{1}{2} \log \left(\frac{\sigma_n^2 + v_f(\x^{acq})}{\sigma_n^2}\right) \end{aligned} \]
\[ \begin{aligned} \text{EIG}(\x^{acq}) &= \frac{1}{2} \log \left(1 + \frac{v_f(\x^{acq})}{\sigma_n^2}\right) \\ &= \frac{1}{2} \log \left(\frac{\sigma_n^2 + v_f(\x^{acq})}{\sigma_n^2}\right) \\ &= \Hof{\Yacq \given \x^{acq}} - \Hof{\Yacq \given \x^{acq}, \W} \\ &= \MIof{\Yacq; \W \given \x^{acq}}. \end{aligned} \]
“\(\Leftrightarrow\) EIG = BALD (Bayesian Active Learning by Disagreement)” (Houlsby et al. 2011), \(\MIof{\Yacq; \W \given \x^{acq}}\), for this Gaussian model.
Next: Laplace as Gaussian approximation.
Using a second-order Taylor expansion around \(\wstar\):
\[ \begin{aligned} \Hof{\w} \approx \Hof{\wstar} &+ \nabla_{\w} \Hof{\wstar} (\w - \wstar) \\ &+ \frac{1}{2}(\w - \wstar)^T \nabla_{\w}^2 \Hof{\wstar} (\w - \wstar) \end{aligned} \]
Simplifying notation with \(\HofJacobian{\wstar} = \nabla_{\w} \Hof{\wstar}\) and \(\HofHessian{\wstar} = \nabla_{\w}^2 \Hof{\wstar}\):
\[ \begin{aligned} \Hof{\w} \approx \Hof{\wstar} &+ \HofJacobian{\wstar} (\w - \wstar) \\ &+ \frac{1}{2}(\w - \wstar)^T H''[\wstar] (\w - \wstar) \end{aligned} \]
We can rewrite this by completing the square:
\[ \begin{aligned} \Hof{\w} &\approx \frac{1}{2}(\w - (\wstar - \HofHessian{\wstar}^{-1}\HofJacobian{\wstar})^T)^T H''[\wstar] \\ &\quad \quad (\w - (\wstar - \HofHessian{\wstar}^{-1}\HofJacobian{\wstar})^T) + \text{const.} \end{aligned} \]
This resembles the information content of a Gaussian distribution:
\[H(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})) = \frac{1}{2}(\w - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\w - \boldsymbol{\mu}) + \text{const.}\]
This gives us the Gaussian approximation:
\[ \w \overset{\approx}{\sim} \mathcal{N}(\wstar - \HofHessian{\wstar}^{-1}\HofJacobian{\wstar}^T, \HofHessian{\wstar}^{-1}) \]
If \(\wstar\) is a (local) minimizer of \(\Hof{\w}\), then \(\HofJacobian{\wstar} = 0\), yielding the Laplace approximation:
\[ \w \overset{\approx}{\sim} \mathcal{N}(\wstar, \HofHessian{\wstar}^{-1}) \]
For the Gaussian approximation, the Hessian matrix \(\HofHessian{\wstar}\) is the precision matrix (inverse covariance) of the approximate posterior:
\[\Sigma^{-1} = \HofHessian{\wstar}\]
Therefore, the entropy of our Gaussian approximation is:
\[\Hof{\W} \approx -\frac{1}{2}\log\det\HofHessian{\wstar} + \frac{k}{2}\log(2\pi e)\]
where \(k\) is the dimensionality of \(\w\).
When we apply Bayes’ rule to parameter inference:
\[ \pof{\w\given\D} = \frac{\pof{\D\given\w}\pof{\w}}{\pof{\D}} \]
Taking the negative log and differentiating twice:
\[ \begin{aligned} \HofHessian{\w \given \D} &= -\nabla^2_\w \log \pof{\w\given\D} \\ &= -\nabla^2_\w \log \left[\frac{\pof{\D\given\w}\pof{\w}}{\pof{\D}}\right] \end{aligned} \]
Using the additivity of logarithms:
\[ \begin{aligned} \HofHessian{\w \given \D} &= -\nabla^2_\w \left[\log \pof{\D\given\w} + \log \pof{\w} - \log \pof{\D}\right] \end{aligned} \]
Since \(\pof{\D}\) is constant with respect to \(\w\), its Hessian is zero:
\[ \HofHessian{\w \given \D} = -\nabla^2_\w \log \pof{\D\given\w} - \nabla^2_\w \log \pof{\w} \]
So we have the Hessian’s Bayes’ rule:
\[ \HofHessian{\w \given \D} = \HofHessian{\D \given \w} + \HofHessian{\w} \]
\[ \HofHessian{\w \given \D} = \HofHessian{\D \given \w} + \HofHessian{\w} = \HofHessian{\w, \D} \]
In terms of precision matrices for our Gaussian approximation:
\[ \Sigma_{\text{posterior}}^{-1} = \Sigma_{\text{likelihood}}^{-1} + \Sigma_{\text{prior}}^{-1} \]
(when we treat the likelihood as unnormalized Gaussian)
\(\HofHessian{\D \given \w}\) is also called the observed information. It is the Hessian of the log-likelihood and also additive in the data points:
\[ \HofHessian{\D \given \w} = \sum_{i=1}^N \HofHessian{y_i \given \w, \x_i}. \]
The Fisher Information Matrix (FIM) is the expected Hessian of the log-likelihood:
\[ \HofHessian{Y \given \x, \w }= \E{\pof{y \given \x, \w}}{ \HofHessian{y \given \x, \w}}. \]
We have the following identities:
\[ \begin{aligned} \HofHessian{Y \given \x, \w} &= \E{\pof{Y \given \x, \w}}{ \HofJacobian{y \given \x, \w}^T \HofJacobian{y \given \x, \w} } \\ &= \implicitCov{\HofJacobian{Y \given \x, \w}^T} \\ &\succeq 0 \quad \text{(positive semi-definite)} \end{aligned} \]
\(\Leftrightarrow\) \(\pof{y \given \prelogits=f(x;\w)}\) with \(f(x;\w)\) the logits and \(\pof{y \given \prelogits}\) from the exponential family:
\[ \log \pof{y \given \prelogits} = \prelogits^T T(y) - A(\prelogits) + \log h(y) \]
Fisher information becomes independent of \(y\):
\[\FisherInfo{Y \given x, \w} = \nabla_\w f(x; \w)^T \, {\nabla_\prelogits^2 A(\prelogits) |_{\prelogits=f(x ; \w)}} \, \nabla_\w f(x; \w)\]
This simplifies computation: no \(\opExpectation\) over \(y\) is needed!
\(\Leftrightarrow\) \(\pof{y \given \prelogits} = \mathcal{N}(y|\prelogits,1)\)
We have:
The Fisher information simplifies to:
\[\FisherInfo{Y \given x, \w} = \nabla_\w f(x; \w)^T \, \nabla_\w f(x; \w)\]
This is simply the outer product of the gradients of the network outputs with respect to the parameters
\(\Leftrightarrow\) \(\pof{y \given \prelogits} = \text{softmax}(\prelogits)_y\)
\(\nabla_\prelogits^2 A(\prelogits) = \text{diag}(\pi) - \pi\pi^T\), where \(\pi_y=\pof{y \given \prelogits}\)
Thus:
\[ \FisherInfo{Y \given x, \w} = \nabla_\w f(x; \w)^T \, (\text{diag}(\pi) - \pi\pi^T) \, \nabla_\w f(x; \w) \]
This captures the geometry of parameter space for classification problems
Definition: A GLM combines:
Key insight: Not only is Fisher information independent of \(y\) for a GLM, but the observed information is too!
\[ \HofHessian{y \given x, \w} = \x^T \, \nabla_\prelogits^2 A(f(x; \w)) \, \x \]
For GLMs, we have the remarkable equality:
\[\FisherInfo{Y \given x, \wstar} = \HofHessian{y \given x, \wstar}\]
for any \(y\) (the observed information is independent of \(y\))
This means expectations over data distributions equal expectations over model distributions.
Why this matters: We would often like an expectation over the true data distribution, but we only have the model’s predictive distribution!
When we have an exponential family but not a GLM, we often approximate:
\[\HofHessian{y \given x, \wstar} \approx \FisherInfo{Y \given x, \wstar}\]
This is the Generalized Gauss-Newton (GGN) approximation
Treat only the last layer as parameters
Write model as \(p(y|x,\w) = p(y|\prelogits = \w^T \phi(x))\), where:
This creates a GLM on top of fixed embeddings
The practical simplification while preserving many benefits?
Used by BADGE, BAIT, PRISM, SIMILAR in practice
Important
The special structure of GLMs and exponential families bridges the gap between active learning (which doesn’t know labels) and active sampling (which does).
For these models, acquisition functions that seem different are actually optimizing the same objectives!
Start with the EIG decomposition:
\[ \begin{aligned} \MIof{\W; \Yacq \given \xacq} &= \Hof{\W} - \Hof{\W \given \Yacq, \xacq} \\ &= \Hof{\W} - \E{\pof{\yacq \given \xacq}}{\Hof{\W \given \yacq, \xacq}} \end{aligned} \]
(I’ll drop the \(\pof{\yacq \given \xacq}\) from the expectation from now on.)
Applying the Gaussian approximation:
\[ \begin{aligned} &\approx -\tfrac{1}{2}\log \det \HofHessian{\wstar} - \E{}{-\tfrac{1}{2}\log \det \HofHessian{\wstar \given \Yacq, \xacq}} \\ &= \tfrac{1}{2}\E{}{\log \det \HofHessian{\wstar \given \Yacq, \xacq} \, \HofHessian{\wstar}^{-1}} \\ &= \tfrac{1}{2}\E{}{\log \det (\HofHessian{\Yacq \given \xacq, \wstar} + \HofHessian{\wstar}) \, \HofHessian{\wstar}^{-1}} \\ &= \tfrac{1}{2}\E{}{\log \det \left(\HofHessian{\Yacq \given \xacq, \wstar}\HofHessian{\wstar}^{-1} + I\right)} \end{aligned} \]
For GLMs (or with GGN approximation), we can simplify further (for any \(\yacq\)):
\[ \begin{aligned} &\MIof{\W; \Yacq \given \xacq} \\ &\quad \approx \tfrac{1}{2}\log \det\left(\FisherInfo{\yacq \given \xacq, \wstar}\HofHessian{\wstar}^{-1} + I\right) \end{aligned} \]
We can also use the trace approximation:
\[ \begin{aligned} \MIof{\W; \Yacq \given \xacq} &\leq \tfrac{1}{2}\tr\left(\FisherInfo{\Yacq \given \xacq, \wstar}\HofHessian{\wstar}^{-1}\right) \end{aligned} \]
Remember our solution for Bayesian linear regression:
\[ \text{EIG}(\x^{acq}) = \frac{1}{2} \log \det\left(I_D + \Sigma \x^{acq} \frac{1}{\sigma_n^2} (\x^{acq})^T\right) \]
Compare to our general approximation:
\[ \approx \tfrac{1}{2}\log \det\left(\FisherInfo{\Yacq \given \xacq, \wstar}\HofHessian{\wstar}^{-1} + I\right) \]
For linear regression, our approximation is exact (of course)!
Important
This means trace approximation ignores redundancies between batch samples (same issue as “top-k” BALD).
EPIG is the mutual information between predictions at acquisition and evaluation points:
\[ \MIof{\Yeval; \Yacq \given \Xeval, \xacq} \]
First, we note this is equivalent to minimizing:
\[ \begin{aligned} &\arg \max_{\xacq} \MIof{\Yeval; \Yacq \given \Xeval, \xacq} \\ &\quad = \arg \min_{\xacq} \MIof{\W; \Yeval \given \Xeval, \Yacq, \xacq} \end{aligned} \]
For GLMs, we can obtain:
\[ \begin{aligned} &\MIof{\W; \Yeval \given \Xeval, \Yacq, \xacq} \\ &\quad \approx \tfrac{1}{2}\opExpectation_{\pdataof{\xeval}} \left [ \log \det\left(\FisherInfo{\Yeval \given \xeval, \wstar} \right. \right . \\ &\quad \quad \quad \left. \left. (\FisherInfo{\Yacq \given \xacq, \wstar} + \HofHessian{\wstar})^{-1} + I\right) \right ] \end{aligned} \]
Using Jensen’s inequality:
\[ \begin{aligned} &\leq \tfrac{1}{2}\log \det\left(\opExpectation_{\pdataof{\xeval}} \left [ \FisherInfo{\Yeval \given \xeval, \wstar}\right ] \right. \\ &\quad \quad \left. (\FisherInfo{\Yacq \given \xacq, \wstar} + \HofHessian{\wstar})^{-1} + I\right) \end{aligned} \]
We can further approximate using the trace:
\[ \begin{aligned} &\MIof{\W; \Yeval \given \Xeval, \Yacq, \xacq} \\ &\leq \tfrac{1}{2}\tr\left(\opExpectation_{\pdataof{\xeval}} \left [ \FisherInfo{\Yeval \given \xeval, \wstar}\right ] \right. \\ &\quad \quad \left. (\FisherInfo{\Yacq \given \xacq, \wstar} + \HofHessian{\wstar})^{-1}\right) \end{aligned} \]
This is the core of the BAIT objective in “Gone Fishing” (Ash et al., 2021).
For GLMs or with GGN approximation, active learning and active sampling objectives are identical
\[ \FisherInfo{\Yacq \given \xacq, \wstar} = \HofHessian{y \given \xacq, \wstar} \]
for any \(\yacq\) (as observed information is independent of \(\yacq\))
This means label knowledge gives no advantage when using these approximations
Many methods use similarity matrices based on gradients of the loss (BADGE, SIMILAR, PRISM)
The similarity matrix is constructed from gradient embeddings: \[ \similarityMatrix{}{\D \given \w}_{ij} = \langle \HofJacobian{\y_i \given \x_i, \wstar}, \HofJacobian{\y_j \given \x_j, \wstar} \rangle \]
If we organize these gradients into a “data matrix”: \[ \HofJacobianData{\D \given \wstar} = \begin{pmatrix} \vdots \\ \HofJacobian{y_i \given x_i, \wstar} \\ \vdots \end{pmatrix} \]
\(\HofJacobianData{\D \given \wstar} \HofJacobianData{\D \given \wstar}^T\) yields the similarity matrix
\(\HofJacobianData{\D \given \wstar}^T \HofJacobianData{\D \given \wstar}\) gives a one-sample estimate of the FIM
Using the Matrix Determinant Lemma: \[ \det(AB + M) = \det(M)\det(I + BM^{-1}A) \]
We can rewrite our EIG approximation: \[ \begin{aligned} &\MIof{\W; \Yacqset \given \xacqset} \\ &\quad \overset{\approx}{\leq} \tfrac{1}{2}\log\det(\FisherInfo{\Yacqset \given \xacqset, \wstar}\HofHessian{\wstar}^{-1} + I) \\ &\quad = \tfrac{1}{2}\log (\det(\HofJacobianData{\D \given \wstar}^T \HofJacobianData{\D \given \wstar} + \HofHessian{\wstar}) \\ &\quad \quad \det(\HofHessian{\wstar}^{-1})) \end{aligned} \]
Into a similarity matrix form: \[ \MIof{\W; \Yacqset \given \xacqset} \overset{\approx}{\leq} \tfrac{1}{2}\log\det(\similarityMatrix{\HofHessian{\wstar}}{\Dacq \given \wstar} + I) \]
where \[ \begin{aligned} &\similarityMatrix{\HofHessian{\wstar}}{\Dacq \given \wstar} \\ &\quad = \HofJacobianData{\Dacq \given \wstar}\HofHessian{\wstar}^{-1}\HofJacobianData{\Dacq \given \wstar}^T \end{aligned} \]
This relies on a one-sample estimate of the FIM: \[\FisherInfo{\Yset \given \xset, \wstar} \approx \HofJacobianData{\D \given \wstar}^T \HofJacobianData{\D \given \wstar}\]
This is problematic because:
For an uninformative prior (\(\HofHessian{\wstar} = \lambda I\) as \(\lambda \to 0\)), we get: \[ \MIof{\W; \Yacqset \given \xacqset} \overset{\approx}{\leq} \tfrac{1}{2}\log\det(\similarityMatrix{}{\Dacq \given \wstar}) \]
Many methods (BADGE, LogDet in SIMILAR/PRISM) optimize this objective
With an uninformative prior: \[ \begin{aligned} &\MIof{\Yevalset; \Yacqset \given \xevalset, \xacqset} \\ &\quad \approx \tfrac{1}{2}\log\det(\similarityMatrix{}{\Deval \given \wstar}) \\ &\quad \quad - \tfrac{1}{2}\log\det(\similarityMatrix{}{\Dacq, \Deval \given \wstar}) \\ &\quad \quad + \tfrac{1}{2}\log\det(\similarityMatrix{}{\Dacq \given \wstar}) \end{aligned} \]
The first term is constant re: \(\xacqset\) and can be dropped
With an uninformative prior: \[ \begin{aligned} &\MIof{\Yeval; \Yacqset \given \Xeval, \xacqset} \\ &\quad \approx \E{\pdataof{\xeval}}{\tfrac{1}{2}\log\det(\similarityMatrix{}{\xeval \given \wstar})} \\ &\quad \quad - \E{\pdataof{\xeval}}{\tfrac{1}{2}\log\det(\similarityMatrix{}{\Dacq, \xeval \given \wstar})} \\ &\quad \quad + \tfrac{1}{2}\log\det(\similarityMatrix{}{\Dacq \given \wstar}) \end{aligned} \]
This connects directly to the LogDetMI objective in SIMILAR/PRISM
Non-transductive (comparing across all information quantities):
\[ \begin{aligned} \log \det &\left(\left\{\begin{aligned} \FisherInfo{\Yacqset \given \xacqset, \wstar} & \text{ (EIG, GGN/GLM)} \\ \HofHessian{\yacqset \given \xacqset, \wstar} & \text{ (IG)} \end{aligned}\right\}\right.\\ &\left.\left(\left\{\begin{aligned} \FisherInfo{\Yset \given \xset, \wstar} & \text{ GGN/GLM} \\ \HofHessian{\yset \given \xset, \wstar} & \end{aligned}\right\} + \HofHessian{\wstar}\right)^{-1} + I\right) \end{aligned} \]
Uses k-means++ on gradient embeddings using hard pseudo-labels
k-means++ approximately maximizes log-determinant (k-DPP)
\[ \log \det(\similarityMatrix{}{\Dacq \given \wstar}) \approx \MIof{\W; \Yacqset \given \xacqset} \]
Directly implements trace approximation of (J)EPIG:
\[ \begin{aligned} \arg\min_{\xacqset} \tr &\left( (\FisherInfo{\Yacqset \given \xacqset, \wstar} + \FisherInfo{\Ytrain \given \xtrain, \wstar} \right.\\ &\left. + \lambda I)^{-1} \FisherInfo{\Yevalset \given \xevalset, \wstar} \right) \end{aligned} \]
Uses last-layer Fisher information
Transductive: uses evaluation set \(\xevalset\) (typically the pool set)
Revealed: transductive active learning
For transductive objectives (LogDetMI): \[ \begin{aligned} &\log \det \similarityMatrix{}{\Dacq \given \wstar} \\ &\quad - \log \det \left(\similarityMatrix{}{\Dacq \given \wstar} - \similarityMatrix{}{\Dacq ; \Deval \given \wstar} \similarityMatrix{}{\Deval \given \wstar}^{-1} \similarityMatrix{}{\Deval ; \Dacq \given \wstar}\right) \end{aligned} \]
This approximates JEPIG!
Important
Both connect to information gain through diagonal FIM approximations!
All are approximating the same few information-theoretic quantities!
Example: Trace approximations ignore batch redundancies → pathologies
High rank correlation between methods:
Weight-space methods preserve relative ranking!
Figure 2: EIG Approximations. Trace and log det approximations match for small scores. They diverge for large scores. Qualitatively, the order matches the prediction-space approximation using BALD with MC dropout.
Figure 3: (J)EPIG Approximations (Normalized). The scores match qualitatively. Note we have reversed the ordering for the proxy objectives for JEPIG and EPIG as they are minimized while EPIG is maximized.
Figure 4: Average Logarithmic RMSE by regression datasets for DNNs: Black-box/prediction-space ■ vs white-box/weight-space □ (vs Uniform). Improvement of the white-box □ method over the uniform baseline on the x-axis and the improvement of the black-box ■ method over the uniform baseline on the y-axis. The average over all datasets is marked with a star ⋆.
Important
The “informativeness” that various methods try to capture collapses to the same information-theoretic quantities known since (Lindley 1956; MacKay 1992).
Understanding these connections allows us to build better, more principled active learning methods.