(Bayesian) Active Learning, Information Theory, and Uncertainty – Transductive Active Learning & Sampling

Today’s Goals

\[ \require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\implicitE}[1]{\opExpectation \left [ #1 \right ]} \DeclareMathOperator{\opVar}{\mathrm{Var}} \newcommand{\Var}[2]{\opVar_{#1} \left [ #2 \right ]} \newcommand{\implicitVar}[1]{\opVar \left [ #1 \right ]} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \newcommand{\sicof}[1]{h(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\Y}{Y} \newcommand{\y}{y} \newcommand{\X}{\boldsymbol{X}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\wstar}{\boldsymbol{\theta^*}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\D}{\mathcal{D}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\indep}{\perp\!\!\!\!\perp} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\specialHofHessian}[2]{\opEntropy''_{#1}[#2]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \newcommand{\specialHofJacobian}[2]{\opEntropy'_{#1}[#2]} \]

By the end of this lecture, you’ll understand:

The difference between transductive and inductive active learning
Expected Predictive Information Gain (EPIG) and its motivation
RhoLoss for active sampling
Practical implications for real-world applications

Inductive vs Transductive

Transductive: From Latin “trans-ducere” (to lead across): Moves directly from specific cases to other specific cases. Like building a bridge between two points.

Inductive: From Latin “in-ducere” (to lead in/into): Moves from specific observations to general principles. Like climbing a mountain to see the entire landscape.

In machine learning terms

Inductive: Learn general rules that extend beyond training examples
Transductive: Transfer knowledge directly from known cases to specific target cases

Think of teaching:

Inductive teaching: Explaining grammar rules that work for any sentence
Transductive teaching: Teaching with a focus on what is needed for Friday’s examination

Naive Transductive AL

Inductive I-Diagram

Transductive I-Diagram

Inductive vs Transductive Learning

Inductive Active Learning

Focuses on learning model parameters
Acquisition maximizes parameter information
Example: BALD maximizes \[ \MIof{\Y ; \W \given \x} \]
Generalizes to any test distribution

Transductive Active Learning

Focuses on specific test/evaluation set
Acquisition maximizes predictive power
Example: EPIG maximizes \[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} \]
Optimized for data distribution

Why Transductive Active Learning?

Often we know our target distribution
Not all parameter uncertainty matters equally
Can ignore irrelevant parameter uncertainty
More efficient use of labeling budget

Expected Predictive Information Gain (EPIG)

EPIG (Smith et al. 2023) measures how much information acquiring labels for \(\x\) gives us about predictions on evaluation point \(\x_{eval}\):

\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} \]

Note the \(X_{\text{eval}}\) in the conditioning set!

We take an expectation over the evaluation distribution.

\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} = \E{\hpcof{}{\x_\text{eval}}}{\MIof{\Y_\text{eval}; \Y \given \x_\text{eval}, \x}} \]

Key Properties

Focuses on predictive uncertainty
Accounts for evaluation distribution
Natural batch extension ala BatchBALD

EPIG vs BALD

What is the difference between EPIG and BALD?

Focus on where learning where it matters most.

EPIG vs BALD

BALD increases (darker shading) as we move away from the existing data, yielding a distant acquisition (★) when maximized. It seeks a global reduction in parameter uncertainty, regardless of any input distribution. In contrast, EPIG is maximized only in regions of relatively high density under the target input distribution, \hpcof{}{\x_\text{eval}}. It seeks a reduction in parameter uncertainty only insofar as it reduces predictive uncertainty on samples from \hpcof{}{\x_\text{eval}}. — BALD increases (darker shading) as we move away from the existing data, yielding a distant acquisition (★) when maximized. It seeks a global reduction in parameter uncertainty, regardless of any input distribution. In contrast, EPIG is maximized only in regions of relatively high density under the target input distribution, \(\hpcof{}{\x_\text{eval}}\). It seeks a reduction in parameter uncertainty only insofar as it reduces predictive uncertainty on samples from \(\hpcof{}{\x_\text{eval}}\).

Problem with BALD

BALD maximizes expected parameter information gain:

\[ \MIof{\Y ; \W \given \x} \]

But there’s a fundamental issue:

Not all parameter information is equally valuable for prediction!

Key Issues with BALD

No notion of target distribution
Can be fooled by outliers
Gets worse with larger pools
Fails even without distribution shift

Why BALD Fails

Parameter Information

Focuses on reducing overall parameter uncertainty
Treats all regions of input space equally
No connection to prediction task
Can be infinite for non-parametric models

Practical Impact

Favors extreme/outlier points
Ignores data density
Wastes labels on irrelevant inputs
Performance degrades with pool size

Empirical Evidence

As pool size increases, BALD’s performance degrades because it targets obscure inputs that have low density under the data distribution and are less relevant for prediction

Comparison

In contrast with BALD, EPIG deals effectively with a big pool (\(10^5\) unlabeled inputs). BALD is overwhelming counterproductive even relative to random acquisition.

Real-World Implications

Web-scraped datasets
- Highly variable relevance
- Many outliers
- BALD targets obscure cases
Scientific data
- Mixed quality sources
- Varying fidelity
- BALD can’t prioritize relevance
Production systems
- Limited labeling budget
- Need focused learning
- BALD wastes resources

The Core Problem

BALD has no mechanism to ensure acquired labels are relevant to the prediction task we care about.

Thus, EPIG, which:

Considers the target distribution
Focuses on predictive uncertainty
Ignores irrelevant parameter uncertainty

Results

From Active Learning to Active Sampling

Active sampling: When we have labels but want to prioritize training

Key insight: Not all labeled examples are equally valuable for training

Information Theory to the Rescue

So far, we have always taking expectations over the labels because we didn’t have them, but what if we do?!?

(Predictive) Information Gain

Information Gain

\[ \MIof{\W ; \Y \given \x} \implies \MIof{\W ; \y \given \x} \]

Predictive Information Gain

\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \X} \implies \MIof{\y_\text{eval}; \y \given \x_\text{eval}, \X} \]

Pointwise Mutual Information

\[ \begin{aligned} &\MIof{\y_\text{eval}; \y \given \x_\text{eval}, \x} \\ &\quad= \Hof{\y_\text{eval} \given \x_\text{eval}} + \Hof{\y \given \x} - \Hof{\y_\text{eval}, \y \given \x_\text{eval}, \x} \end{aligned} \]

Same same!

Joint Pointwise Mutual Information

Obviously, we don’t have only a single \(\x_\text{eval}\).

We could take an expectation over the evaluation distribution.
We can look at the joint distribution:

\[ \MIof{\y_\text{eval, 1}, \y_\text{eval, 2}, .., \y_\text{eval, n}; \y \given \x_\text{eval, 1}, \x_\text{eval, 2}, .., \x_\text{eval, n}, \x} \]

Symmetric Decomposition

The advantage is that we can use the symmetric decomposition of mutual information:

\[ \begin{aligned} &\MIof{\y_\text{eval, 1}, .., \y_\text{eval, n}; \y \given \x_\text{eval, 1}, .., \x_\text{eval, n}, \x} \\ &\quad= \Hof{\y \given \x} - \Hof{\y \given \y_\text{eval, 1}, \x_\text{eval, 1}, .., \y_\text{eval, n}, \x_\text{eval, n}, \x} \end{aligned} \]

That is we can condition on the evaluation set. If we have enough evaluation points, we can just train on all that data asssuming that the additional training data won’t make a big difference.

Enter RhoLoss: Prioritize examples that reduce the holdout loss

RhoLoss (Mindermann et al. 2022)

Train reference model on evaluation data (=holdout set).
Sample according to potential loss reduction for potential training points on holdout set:
- current_loss - holdout_loss,
- where holdout_loss is the loss on the reference model.
- =: RhoLoss
- accounts for irreducible/aleatoric uncertainty: loss on reference model is proxy for irreducible loss (lower bound on loss!)
- prioritizes samples with high RhoLoss that:
  - Are learnable (low noise)
  - Worth learning (high impact)
  - Not yet learned (high potential gain)

RhoLoss in Practice

# Simplified RhoLoss implementation
reference_model = model.copy().fit(holdout_set)

def rho_loss_score(x, y, model, reference_model):
    current_loss = model.score(x, y)
    reference_loss = reference_model.score(x, y)

    rho_loss = current_loss - reference_loss
    return rho_loss

# Train on batches of data
def train_with_rho_loss(model, reference_model, train_data, candidate_batch_size=1000, training_batch_size=32):
    """Train model using RhoLoss sampling strategy.
    
    Args:
        model: Model to train
        train_data: (X, y) tuple of training data
        candidate_batch_size: Size of batch to compute RhoLoss over
        training_batch_size: Size of batch to train on after sampling
    """
    X, y = train_data
    n_samples = len(X)
    
    # Create data loader for large batches
    dataset = torch.utils.data.TensorDataset(X, y)
    loader = torch.utils.data.DataLoader(
        dataset, 
        batch_size=candidate_batch_size,
        shuffle=True
    )
    
    for batch_x, batch_y in loader:
        # Compute RhoLoss for entire batch at once
        rho_losses = rho_loss_score(batch_x, batch_y, model, reference_model)
        
        # Sample points with probability proportional to RhoLoss
        # or like in the paper using top-K (!)
        selection_probs = F.softmax(rho_losses, dim=0)
        indices = torch.multinomial(selection_probs, training_batch_size, replacement=False)
        
        # Train on sampled points
        training_batch_x = batch_x[indices]
        training_batch_y = batch_y[indices]
        
        model.partial_fit(training_batch_x, training_batch_y)
    
    return model

Results on Large-Scale Data

18x fewer training steps
2% higher accuracy
Works across architectures:
- MLPs
- CNNs
- BERT

Practical Considerations

Computational overhead vs benefits
Batch processing for efficiency
Adaptation to different domains
Integration with existing pipelines

Summary

Transductive active learning optimizes for known test distribution
EPIG provides principled information-theoretic framework
RhoLoss extends ideas to active sampling
Significant practical benefits in real-world applications

Questions?

Key Takeaway

Transductive approaches can significantly improve efficiency when we know our target distribution!

Appendix

Mindermann, Sören, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, et al. 2022. “Prioritized Training on Points That Are Learnable, Worth Learning, and Not yet Learnt.” In International Conference on Machine Learning, 15630–49. PMLR.

Smith, Freddie Bickford, Andreas Kirsch, Sebastian Farquhar, Yarin Gal, Adam Foster, and Tom Rainforth. 2023. “Prediction-Oriented Bayesian Active Learning.” In International Conference on Artificial Intelligence and Statistics, 7331–48. PMLR.