Data Loading¶

Datasets¶

Dataset¶

class Dataset¶

Abstract class representing a dataset: a mapping index -> sample, where a sample is a vector of Tensors.

Can be extended to concat, split, batch, resample, etc. datasets.

A Dataset can either own its data directly, or through shared_ptr ownership of underlying Datasets.

Subclassed by fl::BatchDataset, fl::BlobDataset, fl::ConcatDataset, fl::MergeDataset, fl::pkg::speech::ListFileDataset, fl::pkg::text::TextDataset, fl::pkg::vision::DistributedDataset, fl::pkg::vision::LoaderDataset< T >, fl::pkg::vision::TransformAllDataset, fl::PrefetchDataset, fl::ResampleDataset, fl::TensorDataset, fl::TransformDataset

Public Types

using PermutationFunction = std::function<int64_t(int64_t)>¶: A bijective mapping of dataset indices \([0, n) \to [0, n)\).

using TransformFunction = std::function<Tensor(const Tensor&)>¶: A function to transform an array.

using LoadFunction = std::function<Tensor(const std::string&)>¶: A function to load data from a file into an array.

using BatchFunction = std::function<Tensor(const std::vector<Tensor>&)>¶: A function to pack arrays into a batched array.

using DataTransformFunction = std::function<Tensor(void *, fl::Shape, fl::dtype)>¶: A function to transform data from host to array.

using iterator = detail::DatasetIterator<Dataset, std::vector<Tensor>>¶

Public Functions

virtual int64_t size() const = 0¶

Return: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const = 0¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

virtual ~Dataset()¶

iterator begin()¶

iterator end()¶

BatchDatasetPolicy¶

enum fl::BatchDatasetPolicy¶

Policy for handling corner cases when the dataset size is not exactly divisible by batchsize while performing batching.

Values:

INCLUDE_LAST = 0¶: The last samples not evenly divisible by batchsize are packed into a smaller-than-usual batch.

SKIP_LAST = 1¶: The last samples not evenly divisible by batchsize are skipped.

DIVISIBLE_ONLY = 2¶: Constructor raises an error if sizes are not divisible.

BatchDataset¶

class BatchDataset : public fl::Dataset ¶

A view into a dataset where samples are packed into batches.

By default, for each field, the inputs must all have the same dimensions, and it batches along the first singleton dimension.

Example:

// Make a dataset containing 42 tensors of shape [5, 4]
auto tensor = fl::rand({5, 4, 42});
std::vector<Tensor> fields{{tensor}};
auto ds = std::make_shared<TensorDataset>(fields);

// Batch them with batchsize=10
BatchDataset batchds(ds, 10, BatchDatasetPolicy::INCLUDE_LAST);
std::cout << batchds.get(0)[0].shape() << "\n"; // 5 4 10 1
std::cout << batchds.get(4)[0].shape() << "\n"; // 5 4 2 1

// create batch sizes vector specifying the each batch size (dynamic)
std::vector<int64_t> batchSizes = {5, 10, 5, 10, 2, 10}

// Batch them with batchSizes
DynamicBatchDataset batchdsDynamic(ds, batchSizes);
std::cout << batchdsDynamic.get(0)[0].shape() << "\n"; // 5 4 5 1
std::cout << batchdsDynamic.get(5)[0].shape() << "\n"; // 5 4 10 1

Public Functions

BatchDataset(std::shared_ptr<const Dataset> dataset, int64_t batchsize, BatchDatasetPolicy policy = BatchDatasetPolicy::INCLUDE_LAST, const std::vector<BatchFunction> &batchfns = {})¶

Creates a BatchDataset.

Parameters

[in] dataset: The underlying dataset.
[in] batchsize: The desired batch size.
[in] policy: How to handle the last batch if sizes are indivisible.
[in] batchfns: Custom batch function to use for difference indices.

BatchDataset(std::shared_ptr<const Dataset> dataset, const std::vector<int64_t> &batchSizes, const std::vector<BatchFunction> &batchfns = {})¶

Creates a BatchDataset.

Parameters

[in] dataset: The underlying dataset.
[in] batchSizes: desired batch sizes (dynamic).
[in] batchfns: Custom batch function to use for difference indices.

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

ConcatDataset¶

class ConcatDataset : public fl::Dataset ¶

A view into two or more underlying datasets with the indexes concatenated in sequential order.

Example:

// Make two datasets with sizes 10 and 20
auto makeDataset = [](int size) {
  auto tensor = fl::rand({5, 4, size});
  std::vector<Tensor> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds1 = makeDataset(10);
auto ds2 = makeDataset(20);

// Concatenate them
ConcatDataset concatds({ds1, ds2});
std::cout << concatds.size() << "\n"; // 30
std::cout << allClose(concatds.get(15)[0], ds2->get(5)[0]) << "\n"; // 1

Public Functions

ConcatDataset(const std::vector<std::shared_ptr<const Dataset>> &datasets)¶

Creates a ConcatDataset.

Parameters

[in] datasets: The underlying datasets.

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

MergeDataset¶

class MergeDataset : public fl::Dataset ¶

A view into two or more underlying datasets with the same indexes, but with fields combined from all the datasets.

The size of the MergeDataset is the max of the sizes of the input datasets.

We have MergeDataset({ds1, ds2}).get(i) == merge(ds1.get(i), ds2.get(i)) where merge concatenates the std::vector<Tensor> from each dataset.

Example:

// Make two datasets
auto makeDataset = []() {
  auto tensor = fl::rand({5, 4, 10});
  std::vector<Tensor> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds1 = makeDataset();
auto ds2 = makeDataset();

// Merge them
MergeDataset mergeds({ds1, ds2});
std::cout << mergeds.size() << "\n"; // 10
std::cout << allClose(mergeds.get(5)[0], ds1->get(5)[0]) << "\n"; // 1
std::cout << allClose(mergeds.get(5)[1], ds2->get(5)[0]) << "\n"; // 1

Public Functions

MergeDataset(const std::vector<std::shared_ptr<const Dataset>> &datasets)¶

Creates a MergeDataset.

Parameters

[in] datasets: The underlying datasets.

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

ResampleDataset¶

class ResampleDataset : public fl::Dataset ¶

A view into a dataset, with indices remapped.

Note: the mapping doesn’t have to be bijective.

Example:

// Make a dataset with 10 samples
auto tensor = fl::rand({5, 4, 10});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Resample it by reversing it
auto permfn = [ds](int64_t x) { return ds->size() - 1 - x; };
ResampleDataset resampleds(ds, permfn);
std::cout << resampleds.size() << "\n"; // 10
std::cout << allClose(resampleds.get(9)[0], ds->get(0)[0]) << "\n"; // 1

Subclassed by fl::ShuffleDataset

Public Functions

ResampleDataset(std::shared_ptr<const Dataset> dataset)¶

Constructs a ResampleDataset with the identity mapping: ResampleDataset(ds)->get(i) == ds->get(i)

Parameters

[in] dataset: The underlying dataset.

ResampleDataset(std::shared_ptr<const Dataset> dataset, std::vector<int64_t> resamplevec)¶

Constructs a ResampleDataset with mapping specified by a vector: ResampleDataset(ds, v)->get(i) == ds->get(v[i])

Parameters

[in] dataset: The underlying dataset.
[in] resamplevec: The vector specifying the mapping.

ResampleDataset(std::shared_ptr<const Dataset> dataset, const PermutationFunction &resamplefn, int n = -1)¶

Constructs a ResampleDataset with mapping specified by a function: ResampleDataset(ds, fn)->get(i) == ds->get(fn(i)) The function should be deterministic.

Parameters

[in] dataset: The underlying dataset.
[in] resamplefn: The function specifying the mapping.
[in] n: The size of the new dataset (if -1, uses previous size)

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

void resample(std::vector<int64_t> resamplevec)¶

Changes the mapping used to resample the dataset.

Parameters

[in] resamplevec: The vector specifying the new mapping.

ShuffleDataset¶

class ShuffleDataset : public fl::ResampleDataset ¶

A view into a dataset, with indices permuted randomly.

Example:

// Make a dataset with 100 samples
auto tensor = fl::rand({5, 4, 100});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Shuffle it
ShuffleDataset shuffleds(ds);
std::cout << shuffleds.size() << "\n"; // 100
std::cout << "first try" << shuffleds.get(0)["x"] << std::endl;

// Reshuffle it
shuffleds.resample();
std::cout << "second try" << shuffleds.get(0)["x"] << std::endl;

Public Functions

ShuffleDataset(std::shared_ptr<const Dataset> dataset, int seed = 0)¶

Creates a ShuffleDataset.

Parameters

[in] dataset: The underlying dataset.

void resample()¶: Generates a new random permutation for the dataset.

void setSeed(int seed)¶

Sets the PRNG seed.

Parameters

[in] seed: The desired seed.

TensorDataset¶

class TensorDataset : public fl::Dataset ¶

Dataset created by unpacking tensors along the last non-singleton dimension.

The size of the dataset is determined by the size along that dimension. Hence, it must be the same across all int64_ts in the input.

Example:

Tensor tensor1 = fl::rand({5, 4, 10});
Tensor tensor2 = fl::rand({7, 10});
TensorDataset ds({tensor1, tensor2});

std::cout << ds.size() << "\n"; // 10
std::cout << ds.get(0)[0].shape() << "\n"; // 5 4
std::cout << ds.get(0)[1].shape() << "\n"; // 7 1

Public Functions

TensorDataset(const std::vector<Tensor> &datatensors)¶

Creates a TensorDataset by unpacking the input tensors.

Parameters

[in] datatensors: A vector of tensors, which will be unpacked along their last non-singleton dimensions.

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

TransformDataset¶

class TransformDataset : public fl::Dataset ¶

A view into a dataset with values transformed via the specified function(s).

A different transformation may be specified for each array in the input. A null TransformFunction specifies the identity transformation. The dataset size remains unchanged.

Example:

// Make a dataset with 10 samples
auto tensor = fl::rand({5, 4, 10});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Transform it
auto negate = [](const Tensor& arr) { return -arr; };
TransformDataset transformds(ds, {negate});
std::cout << transformds.size() << "\n"; // 10
std::cout << allClose(transformds.get(5)[0], -ds->get(5)[0]) << "\n"; // 1

Public Functions

TransformDataset(std::shared_ptr<const Dataset> dataset, const std::vector<TransformFunction> &transformfns)¶

Creates a TransformDataset.

Parameters

[in] dataset: The underlying dataset.
[in] transformfns: The mappings used to transform the values. If a TransformFunction is null then the corresponding value is not transformed.

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

PrefetchDataset¶

class PrefetchDataset : public fl::Dataset ¶

A view into a dataset, where a given number of samples are prefetched in advance in a ThreadPool.

PrefetchDataset should be used when there is a sequential access to the underlying dataset. Otherwise, there will a lot of cache misses leading to a degraded performance.

Example:

// Make a dataset with 100 samples
auto tensor = fl::rand({5, 4, 100});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Iterate over the dataset using 4 background threads prefetching 2 samples
// in advance
for (auto& sample : PrefetchDataset(ds, 4, 2)) {
    // do something
}

Public Functions

PrefetchDataset(std::shared_ptr<const Dataset> dataset, int64_t numThreads, int64_t prefetchSize)¶

Creates a PrefetchDataset.

Parameters

[in] dataset: The underlying dataset.
[in] numThreads: Number of threads used by the threadpool
[in] prefetchSize: Number of samples to be prefetched

int64_t size() const¶

Return: The size of the dataset.

std::vector<Tensor> get(const int64_t idx) const¶

Return

The sample fields (a std::vector<Tensor>).

Parameters

[in] idx: Index of the sample in the dataset. Must be in [0, size()).

Utils¶

std::vector<int64_t> fl::partitionByRoundRobin(int64_t numSamples, int64_t partitionId, int64_t numPartitions, int64_t batchSz = 1, bool allowEmpty = false)¶

Partitions the samples in a round-robin manner and return ids of the samples.

For dealing with end effects, we include final samples iff we can fit atleast one sample for last batch for all partitions

Parameters

numSamples: total number of samples
partitionId: rank of the current partition [0, numPartitions)
numPartitions: total partitions
batchSz: batchsize to be used

std::pair<std::vector<int64_t>, std::vector<int64_t>> fl::dynamicPartitionByRoundRobin(const std::vector<float> &samplesSize, int64_t partitionId, int64_t numPartitions, int64_t maxSizePerBatch, bool allowEmpty = false)¶

Partitions the samples in a round-robin manner and return ids of the samples with dynamic batching: max number of tokens in the batch (including padded tokens) should be maxTokens.

Parameters

samplesSize: samples length in tokens
partitionId: rank of the current partition [0, numPartitions)
numPartitions: total partitions
maxTokens: total number of tokens in the batch

Tensor fl::makeBatch(const std::vector<Tensor> &data, const Dataset::BatchFunction &batchFn = {})¶

Make batch by applying batchFn to the data.

Parameters

data: data to be batchified
batchFn: function which is applied to make a batch

std::vector<Tensor> fl::makeBatchFromRange(std::shared_ptr<const Dataset> dataset, std::vector<Dataset::BatchFunction> batchFns, int64_t start, int64_t end)¶

Make batch from part of indices (range [start, end) ) by applying set of batch functions.

Parameters

data: dataset from which we take particular samples
batchFns: set of functions which are applied to make a batch
start: start index
end: end index