Ready to go.

You can find the paper here: https://arxiv.org/abs/2102.11582.

Please cite us using:

@article{mukhoti2021deterministic,
  title={Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty},
  author={Mukhoti, Jishnu and Kirsch, Andreas and van Amersfoort, Joost and Torr, Philip HS and Gal, Yarin},
  journal={arXiv preprint arXiv:2102.11582},
  year={2021}
}

We create a new PyTorch VisionDataset for Ambiguous-MNIST and then concatencate it with MNIST (using Joost van Amersfoort's FastMNIST, https://tinyurl.com/pytorch-fast-mnist) to build DirtyMNIST.

We export a constant MNIST_NORMALIZATION in case you want to combine it with other transforms. All datasets take a normalize=True (by default) parameter which normalizes the datasets without a transform, so the transform and target_transform can be empty.

MNIST_NORMALIZATION = Normalize((0.1307,), (0.3081,))

class AmbiguousMNIST[source]

AmbiguousMNIST(*args, **kwds) :: VisionDataset

Ambiguous-MNIST Dataset

Please cite:

    @article{mukhoti2021deterministic,
      title={Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty},
      author={Mukhoti, Jishnu and Kirsch, Andreas and van Amersfoort, Joost and Torr, Philip HS and Gal, Yarin},
      journal={arXiv preprint arXiv:2102.11582},
      year={2021}
    }


Args:
    root (string): Root directory of dataset where ``MNIST/processed/training.pt``
        and  ``MNIST/processed/test.pt`` exist.
    train (bool, optional): If True, creates dataset from ``training.pt``,
        otherwise from ``test.pt``.
    download (bool, optional): If true, downloads the dataset from the internet and
        puts it in root directory. If dataset is already downloaded, it is not
        downloaded again.
    transform (callable, optional): A function/transform that  takes in an PIL image
        and returns a transformed version. E.g, ``transforms.RandomCrop``
    target_transform (callable, optional): A function/transform that takes in the
        target and transforms it.
    normalize (bool, optional): Normalize the samples.
    device: Device to use (pass `num_workers=0, pin_memory=False` to the DataLoader for max throughput)

class FastMNIST[source]

FastMNIST(*args, **kwds) :: MNIST

FastMNIST, based on https://tinyurl.com/pytorch-fast-mnist. It's like MNIST (<http://yann.lecun.com/exdb/mnist/>) but faster.

Args:
    root (string): Root directory of dataset where ``MNIST/processed/training.pt``
        and  ``MNIST/processed/test.pt`` exist.
    train (bool, optional): If True, creates dataset from ``training.pt``,
        otherwise from ``test.pt``.
    download (bool, optional): If true, downloads the dataset from the internet and
        puts it in root directory. If dataset is already downloaded, it is not
        downloaded again.
    transform (callable, optional): A function/transform that  takes in an PIL image
        and returns a transformed version. E.g, ``transforms.RandomCrop``
    target_transform (callable, optional): A function/transform that takes in the
        target and transforms it.
    normalize (bool, optional): Normalize the samples.
    device: Device to use (pass `num_workers=0, pin_memory=False` to the DataLoader for
        max throughput).

DirtyMNIST[source]

DirtyMNIST(root:str, train:bool=True, transform:Optional[Callable]=None, target_transform:Optional[Callable]=None, download:bool=False, normalize=True, noise_stddev=0.05, device=None)

DirtyMNIST

Please cite:

    @article{mukhoti2021deterministic,
      title={Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty},
      author={Mukhoti, Jishnu and Kirsch, Andreas and van Amersfoort, Joost and Torr, Philip HS and Gal, Yarin},
      journal={arXiv preprint arXiv:2102.11582},
      year={2021}
    }

Args:
    root (string): Root directory of dataset where ``MNIST/processed/training.pt``
        and  ``MNIST/processed/test.pt`` exist.
    train (bool, optional): If True, creates dataset from ``training.pt``,
        otherwise from ``test.pt``.
    download (bool, optional): If true, downloads the dataset from the internet and
        puts it in root directory. If dataset is already downloaded, it is not
        downloaded again.
    transform (callable, optional): A function/transform that  takes in an PIL image
        and returns a transformed version. E.g, ``transforms.RandomCrop``
    target_transform (callable, optional): A function/transform that takes in the
        target and transforms it.
    normalize (bool, optional): Normalize the samples.
    device: Device to use (pass `num_workers=0, pin_memory=False` to the DataLoader for
        max throughput).

Example

Let's look at the dataset:

dirty_mnist_train = DirtyMNIST(".", train=True, download=True)

This initializes DirtyMNIST and also normalizes the dataset (equivalent to MNIST_NORMALIZATION = Normalize((0.1307,), (0.3081,))) by default---but faster. Use normalize=False if you don't want to normalize the dataset.

!pip install tqdm

from tqdm.auto import tqdm
Requirement already satisfied: tqdm in /home/blackhc/anaconda3/envs/ddu_dirty_mnist/lib/python3.8/site-packages (4.59.0)
dirty_mnist_train = DirtyMNIST(".", train=True, download=True, device="cuda")

dirty_mnist_dataloader = torch.utils.data.DataLoader(
    dirty_mnist_train,
    batch_size=128,
    shuffle=True,
    num_workers=0,
    pin_memory=False,
)

for image, label in tqdm(dirty_mnist_dataloader):
    image.cuda()
    label.cuda()

This achieves about 800it/s on a workstation compared to the default MNIST dataset which only achieves 40it/s. This insight is from Joost's https://tinyurl.com/pytorch-fast-mnist.

mnist_train = MNIST(
    ".",
    train=True,
    download=True,
    transform=Compose([ToTensor(), MNIST_NORMALIZATION]),
)

mnist_dataloader = torch.utils.data.DataLoader(
    mnist_train, batch_size=128, shuffle=True, num_workers=0, pin_memory=True
)
for image, label in tqdm(mnist_dataloader):
    pass
import matplotlib.pyplot as plt
from torchvision.utils import make_grid

plt.imshow(
    make_grid(dirty_mnist_train.datasets[1].data[: 64 * 10 : 10], nrow=8, normalize=True).permute(1, 2, 0).cpu().numpy()
)
plt.show()
plt.imshow(
    make_grid(dirty_mnist_train.datasets[0].data[: 64 * 10 : 10], nrow=8, normalize=True).permute(1, 2, 0).cpu().numpy()
)
<matplotlib.image.AxesImage at 0x7f2298802820>

"Distributional" Ambiguous MNIST

by Andreas Kirsch, April 2022

Finally, for future active learning research, we also make the full label distribution of the samples in the AmbiguousMNIST dataset available (as created).

class DistributionalAmbiguousMNIST[source]

DistributionalAmbiguousMNIST(*args, **kwds) :: VisionDataset

Ambiguous-MNIST Dataset (Distributional Version)

Please cite for the original dataset:

    @article{mukhoti2021deterministic,
      title={Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty},
      author={Mukhoti, Jishnu and Kirsch, Andreas and van Amersfoort, Joost and Torr, Philip HS and Gal, Yarin},
      journal={arXiv preprint arXiv:2102.11582},
      year={2021}
    }


Args:
    root (string): Root directory of dataset where ``MNIST/processed/training.pt``
        and  ``MNIST/processed/test.pt`` exist.
    train (bool, optional): If True, creates dataset from ``training.pt``,
        otherwise from ``test.pt``.
    download (bool, optional): If true, downloads the dataset from the internet and
        puts it in root directory. If dataset is already downloaded, it is not
        downloaded again.
    transform (callable, optional): A function/transform that  takes in an PIL image
        and returns a transformed version. E.g, ``transforms.RandomCrop``
    target_transform (callable, optional): A function/transform that takes in the
        target and transforms it.
    normalize (bool, optional): Normalize the samples.
    device: Device to use (pass `num_workers=0, pin_memory=False` to the DataLoader for max throughput)

Example

Let's look at the dataset:

train_dataset = DistributionalAmbiguousMNIST(".", train=True, download=True, device="cuda")
list(enumerate(train_dataset[3000][1].cpu().tolist()))
[(0, 0.5075895190238953),
 (1, 1.6385949493269436e-05),
 (2, 0.00026719641755335033),
 (3, 0.0001370514219161123),
 (4, 1.161579689323844e-06),
 (5, 0.04444393143057823),
 (6, 0.4432283043861389),
 (7, 4.886585429630941e-06),
 (8, 0.0037137300241738558),
 (9, 0.0005978558911010623)]
torch.multinomial(train_dataset[3000][1], 50, replacement=True)
tensor([5, 6, 0, 0, 0, 0, 0, 6, 0, 6, 0, 0, 0, 0, 6, 6, 0, 6, 0, 0, 0, 6, 0, 0,
        0, 0, 6, 0, 6, 0, 6, 6, 6, 0, 6, 5, 6, 5, 6, 6, 0, 0, 6, 0, 0, 6, 6, 6,
        0, 6], device='cuda:0')