Data Loading

Datasets

Dataset

class Dataset

Abstract class representing a dataset: a mapping index -> sample, where a sample is a vector of af::arrays.

Can be extended to concat, split, batch, resample, etc. datasets.

A Dataset can either own its data directly, or through shared_ptr ownership of underlying Datasets.

Subclassed by fl::BatchDataset, fl::BlobDataset, fl::ConcatDataset, fl::MergeDataset, fl::PrefetchDataset, fl::ResampleDataset, fl::TensorDataset, fl::TransformDataset

Public Types

using PermutationFunction = std::function<int64_t(int64_t)>

A bijective mapping of dataset indices \([0, n) \to [0, n)\).

using TransformFunction = std::function<af::array(const af::array&)>

A function to transform an array.

using LoadFunction = std::function<af::array(const std::string&)>

A function to load data from a file into an array.

using BatchFunction = std::function<af::array(const std::vector<af::array>&)>

A function to pack arrays into a batched array.

using DataTransformFunction = std::function<af::array(void *, af::dim4, af::dtype)>

A function to transform data from host to array.

using iterator = detail::DatasetIterator<Dataset, std::vector<af::array>>

Public Functions

virtual int64_t size() const = 0

Return

The size of the dataset.

virtual std::vector<af::array> get(const int64_t idx) const = 0

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

virtual ~Dataset()
iterator begin()
iterator end()

BatchDatasetPolicy

enum fl::BatchDatasetPolicy

Policy for handling corner cases when the dataset size is not exactly divisible by batchsize while performing batching.

Values:

INCLUDE_LAST = 0

The last samples not evenly divisible by batchsize are packed into a smaller-than-usual batch.

SKIP_LAST = 1

The last samples not evenly divisible by batchsize are skipped.

DIVISIBLE_ONLY = 2

Constructor raises an error if sizes are not divisible.

BatchDataset

class BatchDataset : public fl::Dataset

A view into a dataset where samples are packed into batches.

By default, for each field, the inputs must all have the same dimensions, and it batches along the first singleton dimension.

Example:

// Make a dataset containing 42 tensors of dims [5, 4]
auto tensor = af::randu(5, 4, 42);
std::vector<af::array> fields{{tensor}};
auto ds = std::make_shared<TensorDataset>(fields);

// Batch them with batchsize=10
BatchDataset batchds(ds, 10, BatchDatasetPolicy::INCLUDE_LAST);
std::cout << batchds.get(0)[0].dims() << "\n"; // 5 4 10 1
std::cout << batchds.get(4)[0].dims() << "\n"; // 5 4 2 1

Public Functions

BatchDataset(std::shared_ptr<const Dataset> dataset, int64_t batchsize, BatchDatasetPolicy policy = BatchDatasetPolicy::INCLUDE_LAST, const std::vector<BatchFunction> &batchfns = {})

Creates a BatchDataset.

Parameters
  • [in] dataset: The underlying dataset.

  • [in] batchsize: The desired batch size.

  • [in] policy: How to handle the last batch if sizes are indivisible.

  • [in] permutationfn: A permutation to be performed prior to batching.

  • [in] batchfns: Custom batch function to use for difference indices.

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

ConcatDataset

class ConcatDataset : public fl::Dataset

A view into two or more underlying datasets with the indexes concatenated in sequential order.

Example:

// Make two datasets with sizes 10 and 20
auto makeDataset = [](int size) {
  auto tensor = af::randu(5, 4, size);
  std::vector<af::array> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds1 = makeDataset(10);
auto ds2 = makeDataset(20);

// Concatenate them
ConcatDataset concatds({ds1, ds2});
std::cout << concatds.size() << "\n"; // 30
std::cout << allClose(concatds.get(15)[0], ds2->get(5)[0]) << "\n"; // 1

Public Functions

ConcatDataset(const std::vector<std::shared_ptr<const Dataset>> &datasets)

Creates a ConcatDataset.

Parameters
  • [in] datasets: The underlying datasets.

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

MergeDataset

class MergeDataset : public fl::Dataset

A view into two or more underlying datasets with the same indexes, but with fields combined from all the datasets.

The size of the MergeDataset is the max of the sizes of the input datasets.

We have MergeDataset({ds1, ds2}).get(i) == merge(ds1.get(i), ds2.get(i)) where merge concatenates the std::vector<af::array> from each dataset.

Example:

// Make two datasets
auto makeDataset = []() {
  auto tensor = af::randu(5, 4, 10);
  std::vector<af::array> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds1 = makeDataset();
auto ds2 = makeDataset();

// Merge them
MergeDataset mergeds({ds1, ds2});
std::cout << mergeds.size() << "\n"; // 10
std::cout << allClose(mergeds.get(5)[0], ds1->get(5)[0]) << "\n"; // 1
std::cout << allClose(mergeds.get(5)[1], ds2->get(5)[0]) << "\n"; // 1

Public Functions

MergeDataset(const std::vector<std::shared_ptr<const Dataset>> &datasets)

Creates a MergeDataset.

Parameters
  • [in] datasets: The underlying datasets.

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

ResampleDataset

class ResampleDataset : public fl::Dataset

A view into a dataset, with indices remapped.

Note: the mapping doesn’t have to be bijective.

Example:

// Make a dataset with 10 samples
auto tensor = af::randu(5, 4, 10);
std::vector<af::array> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Resample it by reversing it
auto permfn = [ds](int64_t x) { return ds->size() - 1 - x; };
ResampleDataset resampleds(ds, permfn);
std::cout << resampleds.size() << "\n"; // 10
std::cout << allClose(resampleds.get(9)[0], ds->get(0)[0]) << "\n"; // 1

Subclassed by fl::ShuffleDataset

Public Functions

ResampleDataset(std::shared_ptr<const Dataset> dataset)

Constructs a ResampleDataset with the identity mapping: ResampleDataset(ds)->get(i) == ds->get(i)

Parameters
  • [in] dataset: The underlying dataset.

ResampleDataset(std::shared_ptr<const Dataset> dataset, std::vector<int64_t> resamplevec)

Constructs a ResampleDataset with mapping specified by a vector: ResampleDataset(ds, v)->get(i) == ds->get(v[i])

Parameters
  • [in] dataset: The underlying dataset.

  • [in] resamplevec: The vector specifying the mapping.

ResampleDataset(std::shared_ptr<const Dataset> dataset, const PermutationFunction &resamplefn, int n = -1)

Constructs a ResampleDataset with mapping specified by a function: ResampleDataset(ds, fn)->get(i) == ds->get(fn(i)) The function should be deterministic.

Parameters
  • [in] dataset: The underlying dataset.

  • [in] resamplefn: The function specifying the mapping.

  • [in] n: The size of the new dataset (if -1, uses previous size)

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

void resample(std::vector<int64_t> resamplevec)

Changes the mapping used to resample the dataset.

Parameters
  • [in] resamplevec: The vector specifying the new mapping.

ShuffleDataset

class ShuffleDataset : public fl::ResampleDataset

A view into a dataset, with indices permuted randomly.

Example:

// Make a dataset with 100 samples
auto tensor = af::randu(5, 4, 100);
std::vector<af::array> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Shuffle it
ShuffleDataset shuffleds(ds);
std::cout << shuffleds.size() << "\n"; // 100
af::print("first try", shuffleds.get(0)["x"]);

// Reshuffle it
shuffleds.resample();
af::print("second try", shuffleds.get(0)["x"]);

Public Functions

ShuffleDataset(std::shared_ptr<const Dataset> dataset)

Creates a ShuffleDataset.

Parameters
  • [in] dataset: The underlying dataset.

void resample()

Generates a new random permutation for the dataset.

void setSeed(int seed)

Sets the PRNG seed.

Parameters
  • [in] seed: The desired seed.

TensorDataset

class TensorDataset : public fl::Dataset

Dataset created by unpacking tensors along the last non-singleton dimension.

The size of the dataset is determined by the size along that dimension. Hence, it must be the same across all int64_ts in the input.

Example:

af::array tensor1 = af::randu(5, 4, 10);
af::array tensor2 = af::randu(7, 10);
TensorDataset ds({tensor1, tensor2});

std::cout << ds.size() << "\n"; // 10
std::cout << ds.get(0)[0].dims() << "\n"; // 5 4 1 1
std::cout << ds.get(0)[1].dims() << "\n"; // 7 1 1 1

Public Functions

TensorDataset(const std::vector<af::array> &datatensors)

Creates a TensorDataset by unpacking the input tensors.

Parameters
  • [in] datatensors: A vector of tensors, which will be unpacked along their last non-singleton dimensions.

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

TransformDataset

class TransformDataset : public fl::Dataset

A view into a dataset with values transformed via the specified function(s).

A different transformation may be specified for each int64_t. A missing int64_t specifies the identity transformation. The dataset size remains unchanged.

Example:

// Make a dataset with 10 samples
auto tensor = af::randu(5, 4, 10);
std::vector<af::array> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Transform it
auto negate = [](const af::array& arr) { return -arr; };
TransformDataset transformds(ds, {negate});
std::cout << transformds.size() << "\n"; // 10
std::cout << allClose(transformds.get(5)[0], -ds->get(5)[0]) << "\n"; // 1

Public Functions

TransformDataset(std::shared_ptr<const Dataset> dataset, const std::vector<TransformFunction> &transformfns)

Creates a TransformDataset.

Parameters
  • [in] dataset: The underlying dataset.

  • [in] transformfns: The mappings used to transform the values. If a int64_t is missing then the corresponding value is not transformed.

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

PrefetchDataset

class PrefetchDataset : public fl::Dataset

A view into a dataset, where a given number of samples are prefetched in advance in a ThreadPool.

PrefetchDataset should be used when there is a sequential access to the underlying dataset. Otherwise, there will a lot of cache misses leading to a degraded performance.

Example:

// Make a dataset with 100 samples
auto tensor = af::randu(5, 4, 100);
std::vector<af::array> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Iterate over the dataset using 4 background threads prefetching 2 samples
// in advance
for (auto& sample : PrefetchDataset(ds, 4, 2)) {
    // do something
}

Public Functions

PrefetchDataset(std::shared_ptr<const Dataset> dataset, int64_t numThreads, int64_t prefetchSize)

Creates a PrefetchDataset.

Parameters
  • [in] dataset: The underlying dataset.

  • [in] numThreads: Number of threads used by the threadpool

  • [in] prefetchSize: Number of samples to be prefetched

int64_t size() const

Return

The size of the dataset.

std::vector<af::array> get(const int64_t idx) const

Return

The sample fields (a std::vector<af::array>).

Parameters
  • [in] idx: Index of the sample in the dataset. Must be in [0, size()).

Utils

namespace fl

Copyright (c) Facebook, Inc.

and its affiliates. All rights reserved.

This source code is licensed under the BSD-style license found in the LICENSE file in the root directory of this source tree.

and its affiliates. All rights reserved.

This source code is licensed under the BSD-style license found in the LICENSE file in the root directory of this source tree. Logging is a light, multi-level, compile time filterable, logging infrastructure that is similar to glog in output format. It defines two logging macros, one for any logging and the other for more verbose logging. Compile time filter is applied separately to each of the two.

Output format: LMMDD HH:MM:SS.uuuuuu tid filename:##] Log message … L: Log level (Fatal, Critical, Error, Warning, Info) MMDD: month, day HH:MM:SS.uuuuuu: time (24-hour format) with micro-seconds tid: thread ID filename:## the basename of the source file and line number of the LOG message

LOG use examples: LOG(INFO) << “foo bar n=” << 42; Output example: I0206 10:42:21.047293 87072 Logging.h:15 foo bar n=42 Note that LOG(level) only prints when level is <= from value set to Logging::setMaxLoggingLevel(level)

VLOG use example: VLOG(1) << “foo bar n=” << 42; Output example: vlog(1)0206 10:42:21.005439 87072 Logging.h:23 foo bar n=42 Note that VLOG(level) only prints when level is <= from value set to VerboseLogging::setMaxLoggingLevel(level)

and its affiliates. All rights reserved.

This source code is licensed under the BSD-style license found in the LICENSE file in the root directory of this source tree. The configurable memory allocator is obtained by calling: std::unique_ptr<MemoryAllocator> CreateMemoryAllocator(config) Config defines a a set of allocators assembled in a CompositeMemoryAllocator.

Functions

std::vector<int64_t> partitionByRoundRobin(int64_t numSamples, int64_t partitionId, int64_t numPartitions, int64_t batchSz = 1)

Partitions the samples in a round-robin manner and return ids of the samples.

For dealing with end effects, we include final samples iff we can fit atleast one sample for last batch for all partitions

Parameters
  • numSamples: total number of samples

  • partitionId: rank of the current partition [0, numPartitions)

  • numPartitions: total partitions

  • batchSz: batchsize to be used