(EPIG and RhoLoss)
\[ \require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\implicitE}[1]{\opExpectation \left [ #1 \right ]} \DeclareMathOperator{\opVar}{\mathrm{Var}} \newcommand{\Var}[2]{\opVar_{#1} \left [ #2 \right ]} \newcommand{\implicitVar}[1]{\opVar \left [ #1 \right ]} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \newcommand{\sicof}[1]{h(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\Y}{Y} \newcommand{\y}{y} \newcommand{\X}{\boldsymbol{X}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\wstar}{\boldsymbol{\theta^*}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\D}{\mathcal{D}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\indep}{\perp\!\!\!\!\perp} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\specialHofHessian}[2]{\opEntropy''_{#1}[#2]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \newcommand{\specialHofJacobian}[2]{\opEntropy'_{#1}[#2]} \]
By the end of this lecture, you’ll understand:
In machine learning terms
Think of teaching:
Inductive Active Learning
Transductive Active Learning
EPIG (Smith et al. 2023) measures how much information acquiring labels for \(\x\) gives us about predictions on evaluation point \(\x_{eval}\):
\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} \]
Note the \(X_{\text{eval}}\) in the conditioning set!
We take an expectation over the evaluation distribution.
\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} = \E{\hpcof{}{\x_\text{eval}}}{\MIof{\Y_\text{eval}; \Y \given \x_\text{eval}, \x}} \]
What is the difference between EPIG and BALD?
Focus on where learning where it matters most.
BALD maximizes expected parameter information gain:
\[ \MIof{\Y ; \W \given \x} \]
But there’s a fundamental issue:
Not all parameter information is equally valuable for prediction!
Parameter Information
Practical Impact
As pool size increases, BALD’s performance degrades because it targets obscure inputs that have low density under the data distribution and are less relevant for prediction
In contrast with BALD, EPIG deals effectively with a big pool (\(10^5\) unlabeled inputs). BALD is overwhelming counterproductive even relative to random acquisition.
BALD has no mechanism to ensure acquired labels are relevant to the prediction task we care about.
Thus, EPIG, which:
Active sampling: When we have labels but want to prioritize training
Key insight: Not all labeled examples are equally valuable for training
So far, we have always taking expectations over the labels because we didn’t have them, but what if we do?!?
\[ \MIof{\W ; \Y \given \x} \implies \MIof{\W ; \y \given \x} \]
\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \X} \implies \MIof{\y_\text{eval}; \y \given \x_\text{eval}, \X} \]
\[ \begin{aligned} &\MIof{\y_\text{eval}; \y \given \x_\text{eval}, \x} \\ &\quad= \Hof{\y_\text{eval} \given \x_\text{eval}} + \Hof{\y \given \x} - \Hof{\y_\text{eval}, \y \given \x_\text{eval}, \x} \end{aligned} \]
Same same!
Obviously, we don’t have only a single \(\x_\text{eval}\).
We could take an expectation over the evaluation distribution.
We can look at the joint distribution:
\[ \MIof{\y_\text{eval, 1}, \y_\text{eval, 2}, .., \y_\text{eval, n}; \y \given \x_\text{eval, 1}, \x_\text{eval, 2}, .., \x_\text{eval, n}, \x} \]
The advantage is that we can use the symmetric decomposition of mutual information:
\[ \begin{aligned} &\MIof{\y_\text{eval, 1}, .., \y_\text{eval, n}; \y \given \x_\text{eval, 1}, .., \x_\text{eval, n}, \x} \\ &\quad= \Hof{\y \given \x} - \Hof{\y \given \y_\text{eval, 1}, \x_\text{eval, 1}, .., \y_\text{eval, n}, \x_\text{eval, n}, \x} \end{aligned} \]
That is we can condition on the evaluation set. If we have enough evaluation points, we can just train on all that data asssuming that the additional training data won’t make a big difference.
Enter RhoLoss: Prioritize examples that reduce the holdout loss
Train reference model on evaluation data (=holdout set).
Sample according to potential loss reduction for potential training points on holdout set:
current_loss - holdout_loss
,holdout_loss
is the loss on the reference model.=: RhoLoss
# Simplified RhoLoss implementation
reference_model = model.copy().fit(holdout_set)
def rho_loss_score(x, y, model, reference_model):
current_loss = model.score(x, y)
reference_loss = reference_model.score(x, y)
rho_loss = current_loss - reference_loss
return rho_loss
# Train on batches of data
def train_with_rho_loss(model, reference_model, train_data, candidate_batch_size=1000, training_batch_size=32):
"""Train model using RhoLoss sampling strategy.
Args:
model: Model to train
train_data: (X, y) tuple of training data
candidate_batch_size: Size of batch to compute RhoLoss over
training_batch_size: Size of batch to train on after sampling
"""
X, y = train_data
n_samples = len(X)
# Create data loader for large batches
dataset = torch.utils.data.TensorDataset(X, y)
loader = torch.utils.data.DataLoader(
dataset,
batch_size=candidate_batch_size,
shuffle=True
)
for batch_x, batch_y in loader:
# Compute RhoLoss for entire batch at once
rho_losses = rho_loss_score(batch_x, batch_y, model, reference_model)
# Sample points with probability proportional to RhoLoss
# or like in the paper using top-K (!)
selection_probs = F.softmax(rho_losses, dim=0)
indices = torch.multinomial(selection_probs, training_batch_size, replacement=False)
# Train on sampled points
training_batch_x = batch_x[indices]
training_batch_y = batch_y[indices]
model.partial_fit(training_batch_x, training_batch_y)
return model
Key Takeaway
Transductive approaches can significantly improve efficiency when we know our target distribution!