Transductive Active Learning & Sampling

(EPIG and RhoLoss)

Today’s Goals

\[ \require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\implicitE}[1]{\opExpectation \left [ #1 \right ]} \DeclareMathOperator{\opVar}{\mathrm{Var}} \newcommand{\Var}[2]{\opVar_{#1} \left [ #2 \right ]} \newcommand{\implicitVar}[1]{\opVar \left [ #1 \right ]} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \newcommand{\sicof}[1]{h(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\Y}{Y} \newcommand{\y}{y} \newcommand{\X}{\boldsymbol{X}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\wstar}{\boldsymbol{\theta^*}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\D}{\mathcal{D}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\indep}{\perp\!\!\!\!\perp} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\specialHofHessian}[2]{\opEntropy''_{#1}[#2]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \newcommand{\specialHofJacobian}[2]{\opEntropy'_{#1}[#2]} \]

By the end of this lecture, you’ll understand:

  1. The difference between transductive and inductive active learning
  2. Expected Predictive Information Gain (EPIG) and its motivation
  3. RhoLoss for active sampling
  4. Practical implications for real-world applications

Inductive vs Transductive

Transductive
From Latin “trans-ducere” (to lead across): Moves directly from specific cases to other specific cases. Like building a bridge between two points.
Inductive
From Latin “in-ducere” (to lead in/into): Moves from specific observations to general principles. Like climbing a mountain to see the entire landscape.

In machine learning terms

  • Inductive: Learn general rules that extend beyond training examples
  • Transductive: Transfer knowledge directly from known cases to specific target cases

Think of teaching:

  • Inductive teaching: Explaining grammar rules that work for any sentence
  • Transductive teaching: Teaching with a focus on what is needed for Friday’s examination

Naive Transductive AL

Inductive I-Diagram

Transductive I-Diagram

Inductive vs Transductive Learning

Inductive Active Learning

  • Focuses on learning model parameters
  • Acquisition maximizes parameter information
  • Example: BALD maximizes \[ \MIof{\Y ; \W \given \x} \]
  • Generalizes to any test distribution

Transductive Active Learning

  • Focuses on specific test/evaluation set
  • Acquisition maximizes predictive power
  • Example: EPIG maximizes \[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} \]
  • Optimized for data distribution

Why Transductive Active Learning?

  1. Often we know our target distribution
  2. Not all parameter uncertainty matters equally
  3. Can ignore irrelevant parameter uncertainty
  4. More efficient use of labeling budget

Expected Predictive Information Gain (EPIG)

EPIG (Smith et al. 2023) measures how much information acquiring labels for \(\x\) gives us about predictions on evaluation point \(\x_{eval}\):

\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} \]

Note the \(X_{\text{eval}}\) in the conditioning set!

We take an expectation over the evaluation distribution.

\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \x} = \E{\hpcof{}{\x_\text{eval}}}{\MIof{\Y_\text{eval}; \Y \given \x_\text{eval}, \x}} \]

Key Properties

  1. Focuses on predictive uncertainty
  2. Accounts for evaluation distribution
  3. Natural batch extension ala BatchBALD

EPIG vs BALD

What is the difference between EPIG and BALD?

Focus on where learning where it matters most.

EPIG vs BALD

BALD increases (darker shading) as we move away from the existing data, yielding a distant acquisition (★) when maximized. It seeks a global reduction in parameter uncertainty, regardless of any input distribution. In contrast, EPIG is maximized only in regions of relatively high density under the target input distribution, \hpcof{}{\x_\text{eval}}. It seeks a reduction in parameter uncertainty only insofar as it reduces predictive uncertainty on samples from \hpcof{}{\x_\text{eval}}.

BALD increases (darker shading) as we move away from the existing data, yielding a distant acquisition (★) when maximized. It seeks a global reduction in parameter uncertainty, regardless of any input distribution. In contrast, EPIG is maximized only in regions of relatively high density under the target input distribution, \(\hpcof{}{\x_\text{eval}}\). It seeks a reduction in parameter uncertainty only insofar as it reduces predictive uncertainty on samples from \(\hpcof{}{\x_\text{eval}}\).

Problem with BALD

BALD maximizes expected parameter information gain:

\[ \MIof{\Y ; \W \given \x} \]

But there’s a fundamental issue:

Not all parameter information is equally valuable for prediction!

Key Issues with BALD

  1. No notion of target distribution
  2. Can be fooled by outliers
  3. Gets worse with larger pools
  4. Fails even without distribution shift

Why BALD Fails

Parameter Information

  • Focuses on reducing overall parameter uncertainty
  • Treats all regions of input space equally
  • No connection to prediction task
  • Can be infinite for non-parametric models

Practical Impact

  • Favors extreme/outlier points
  • Ignores data density
  • Wastes labels on irrelevant inputs
  • Performance degrades with pool size

Empirical Evidence

As pool size increases, BALD’s performance degrades because it targets obscure inputs that have low density under the data distribution and are less relevant for prediction

Comparison

In contrast with BALD, EPIG deals effectively with a big pool (\(10^5\) unlabeled inputs). BALD is overwhelming counterproductive even relative to random acquisition.

Real-World Implications

  1. Web-scraped datasets
    • Highly variable relevance
    • Many outliers
    • BALD targets obscure cases
  2. Scientific data
    • Mixed quality sources
    • Varying fidelity
    • BALD can’t prioritize relevance
  3. Production systems
    • Limited labeling budget
    • Need focused learning
    • BALD wastes resources

The Core Problem

BALD has no mechanism to ensure acquired labels are relevant to the prediction task we care about.

Thus, EPIG, which:

  1. Considers the target distribution
  2. Focuses on predictive uncertainty
  3. Ignores irrelevant parameter uncertainty

Results

From Active Learning to Active Sampling

Active sampling: When we have labels but want to prioritize training

Key insight: Not all labeled examples are equally valuable for training

Information Theory to the Rescue

So far, we have always taking expectations over the labels because we didn’t have them, but what if we do?!?

(Predictive) Information Gain

Information Gain

\[ \MIof{\W ; \Y \given \x} \implies \MIof{\W ; \y \given \x} \]

Predictive Information Gain

\[ \MIof{\Y_\text{eval}; \Y \given \X_\text{eval}, \X} \implies \MIof{\y_\text{eval}; \y \given \x_\text{eval}, \X} \]

Pointwise Mutual Information

\[ \begin{aligned} &\MIof{\y_\text{eval}; \y \given \x_\text{eval}, \x} \\ &\quad= \Hof{\y_\text{eval} \given \x_\text{eval}} + \Hof{\y \given \x} - \Hof{\y_\text{eval}, \y \given \x_\text{eval}, \x} \end{aligned} \]

Same same!

Joint Pointwise Mutual Information

Obviously, we don’t have only a single \(\x_\text{eval}\).

  1. We could take an expectation over the evaluation distribution.

  2. We can look at the joint distribution:

    \[ \MIof{\y_\text{eval, 1}, \y_\text{eval, 2}, .., \y_\text{eval, n}; \y \given \x_\text{eval, 1}, \x_\text{eval, 2}, .., \x_\text{eval, n}, \x} \]

Symmetric Decomposition

The advantage is that we can use the symmetric decomposition of mutual information:

\[ \begin{aligned} &\MIof{\y_\text{eval, 1}, .., \y_\text{eval, n}; \y \given \x_\text{eval, 1}, .., \x_\text{eval, n}, \x} \\ &\quad= \Hof{\y \given \x} - \Hof{\y \given \y_\text{eval, 1}, \x_\text{eval, 1}, .., \y_\text{eval, n}, \x_\text{eval, n}, \x} \end{aligned} \]

That is we can condition on the evaluation set. If we have enough evaluation points, we can just train on all that data asssuming that the additional training data won’t make a big difference.

Enter RhoLoss: Prioritize examples that reduce the holdout loss

RhoLoss (Mindermann et al. 2022)

  1. Train reference model on evaluation data (=holdout set).

  2. Sample according to potential loss reduction for potential training points on holdout set:

    • current_loss - holdout_loss,
    • where holdout_loss is the loss on the reference model.
    • =: RhoLoss
    • accounts for irreducible/aleatoric uncertainty: loss on reference model is proxy for irreducible loss (lower bound on loss!)
    • prioritizes samples with high RhoLoss that:
      • Are learnable (low noise)
      • Worth learning (high impact)
      • Not yet learned (high potential gain)

RhoLoss in Practice

# Simplified RhoLoss implementation
reference_model = model.copy().fit(holdout_set)

def rho_loss_score(x, y, model, reference_model):
    current_loss = model.score(x, y)
    reference_loss = reference_model.score(x, y)

    rho_loss = current_loss - reference_loss
    return rho_loss

# Train on batches of data
def train_with_rho_loss(model, reference_model, train_data, candidate_batch_size=1000, training_batch_size=32):
    """Train model using RhoLoss sampling strategy.
    
    Args:
        model: Model to train
        train_data: (X, y) tuple of training data
        candidate_batch_size: Size of batch to compute RhoLoss over
        training_batch_size: Size of batch to train on after sampling
    """
    X, y = train_data
    n_samples = len(X)
    
    # Create data loader for large batches
    dataset = torch.utils.data.TensorDataset(X, y)
    loader = torch.utils.data.DataLoader(
        dataset, 
        batch_size=candidate_batch_size,
        shuffle=True
    )
    
    for batch_x, batch_y in loader:
        # Compute RhoLoss for entire batch at once
        rho_losses = rho_loss_score(batch_x, batch_y, model, reference_model)
        
        # Sample points with probability proportional to RhoLoss
        # or like in the paper using top-K (!)
        selection_probs = F.softmax(rho_losses, dim=0)
        indices = torch.multinomial(selection_probs, training_batch_size, replacement=False)
        
        # Train on sampled points
        training_batch_x = batch_x[indices]
        training_batch_y = batch_y[indices]
        
        model.partial_fit(training_batch_x, training_batch_y)
    
    return model

Results on Large-Scale Data

  • 18x fewer training steps
  • 2% higher accuracy
  • Works across architectures:
    • MLPs
    • CNNs
    • BERT

Practical Considerations

  1. Computational overhead vs benefits
  2. Batch processing for efficiency
  3. Adaptation to different domains
  4. Integration with existing pipelines

Summary

  1. Transductive active learning optimizes for known test distribution
  2. EPIG provides principled information-theoretic framework
  3. RhoLoss extends ideas to active sampling
  4. Significant practical benefits in real-world applications

Questions?

Key Takeaway

Transductive approaches can significantly improve efficiency when we know our target distribution!

Appendix

Mindermann, Sören, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, et al. 2022. “Prioritized Training on Points That Are Learnable, Worth Learning, and Not yet Learnt.” In International Conference on Machine Learning, 15630–49. PMLR.
Smith, Freddie Bickford, Andreas Kirsch, Sebastian Farquhar, Yarin Gal, Adam Foster, and Tom Rainforth. 2023. “Prediction-Oriented Bayesian Active Learning.” In International Conference on Artificial Intelligence and Statistics, 7331–48. PMLR.