Data Loading¶

Datasets¶

Dataset¶

class Dataset¶

Abstract class representing a dataset: a mapping index -> sample, where a sample is a vector of Tensors.

Can be extended to concat, split, batch, resample, etc. datasets.

A Dataset can either own its data directly, or through shared_ptr ownership of underlying Datasets.

Subclassed by fl::BatchDataset, fl::BlobDataset, fl::ConcatDataset, fl::MergeDataset, fl::PrefetchDataset, fl::ResampleDataset, fl::SpanDataset, fl::TensorDataset, fl::TransformDataset, fl::pkg::speech::ListFileDataset, fl::pkg::text::TextDataset, fl::pkg::vision::DistributedDataset, fl::pkg::vision::LoaderDataset< T >, fl::pkg::vision::TransformAllDataset

Public Types

using PermutationFunction = std::function<int64_t(int64_t)>¶: A bijective mapping of dataset indices \([0, n) \to [0, n)\).

using TransformFunction = std::function<Tensor(const Tensor&)>¶: A function to transform an array.

using LoadFunction = std::function<Tensor(const std::string&)>¶: A function to load data from a file into an array.

using BatchFunction = std::function<Tensor(const std::vector<Tensor>&)>¶: A function to pack arrays into a batched array.

using DataTransformFunction = std::function<Tensor(void*, fl::Shape, fl::dtype)>¶: A function to transform data from host to array.

using iterator = detail::DatasetIterator<Dataset, std::vector<Tensor>>¶

Public Functions

virtual int64_t size() const = 0¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const = 0¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

virtual ~Dataset() = default¶

inline iterator begin()¶

inline iterator end()¶

BatchDatasetPolicy¶

enum class fl::BatchDatasetPolicy¶

Policy for handling corner cases when the dataset size is not exactly divisible by batchsize while performing batching.

Values:

enumerator INCLUDE_LAST¶: The last samples not evenly divisible by batchsize are packed into a smaller-than-usual batch.

enumerator SKIP_LAST¶: The last samples not evenly divisible by batchsize are skipped.

enumerator DIVISIBLE_ONLY¶: Constructor raises an error if sizes are not divisible.

BatchDataset¶

class BatchDataset : public fl::Dataset ¶

A view into a dataset where samples are packed into batches.

By default, for each field, the inputs must all have the same dimensions, and it batches along the first singleton dimension.

Example:

// Make a dataset containing 42 tensors of shape [5, 4]
auto tensor = fl::rand({5, 4, 42});
std::vector<Tensor> fields{{tensor}};
auto ds = std::make_shared<TensorDataset>(fields);

// Batch them with batchsize=10
BatchDataset batchds(ds, 10, BatchDatasetPolicy::INCLUDE_LAST);
std::cout << batchds.get(0)[0].shape() << "\n"; // 5 4 10 1
std::cout << batchds.get(4)[0].shape() << "\n"; // 5 4 2 1

// create batch sizes vector specifying the each batch size (dynamic)
std::vector<int64_t> batchSizes = {5, 10, 5, 10, 2, 10}

// Batch them with batchSizes
DynamicBatchDataset batchdsDynamic(ds, batchSizes);
std::cout << batchdsDynamic.get(0)[0].shape() << "\n"; // 5 4 5 1
std::cout << batchdsDynamic.get(5)[0].shape() << "\n"; // 5 4 10 1

Public Functions

BatchDataset(std::shared_ptr<const Dataset> dataset, int64_t batchsize, BatchDatasetPolicy policy = BatchDatasetPolicy::INCLUDE_LAST, const std::vector<BatchFunction> &batchfns = {})¶

Creates a BatchDataset.

Parameters:

dataset – [in] The underlying dataset.
batchsize – [in] The desired batch size.
policy – [in] How to handle the last batch if sizes are indivisible.
batchfns – [in] Custom batch function to use for difference indices.

BatchDataset(std::shared_ptr<const Dataset> dataset, const std::vector<int64_t> &batchSizes, const std::vector<BatchFunction> &batchfns = {})¶

Creates a BatchDataset.

Parameters:

dataset – [in] The underlying dataset.
batchSizes – [in] desired batch sizes (dynamic).
batchfns – [in] Custom batch function to use for difference indices.

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

ConcatDataset¶

class ConcatDataset : public fl::Dataset ¶

A view into two or more underlying datasets with the indexes concatenated in sequential order.

Example:

// Make two datasets with sizes 10 and 20
auto makeDataset = [](int size) {
  auto tensor = fl::rand({5, 4, size});
  std::vector<Tensor> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds1 = makeDataset(10);
auto ds2 = makeDataset(20);

// Concatenate them
ConcatDataset concatds({ds1, ds2});
std::cout << concatds.size() << "\n"; // 30
std::cout << allClose(concatds.get(15)[0], ds2->get(5)[0]) << "\n"; // 1

Public Functions

explicit ConcatDataset(const std::vector<std::shared_ptr<const Dataset>> &datasets)¶

Creates a ConcatDataset.

Parameters:: datasets – [in] The underlying datasets.

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

MergeDataset¶

class MergeDataset : public fl::Dataset ¶

A view into two or more underlying datasets with the same indexes, but with fields combined from all the datasets.

The size of the MergeDataset is the max of the sizes of the input datasets.

We have MergeDataset({ds1, ds2}).get(i) == merge(ds1.get(i), ds2.get(i)) where merge concatenates the std::vector<Tensor> from each dataset.

Example:

// Make two datasets
auto makeDataset = []() {
  auto tensor = fl::rand({5, 4, 10});
  std::vector<Tensor> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds1 = makeDataset();
auto ds2 = makeDataset();

// Merge them
MergeDataset mergeds({ds1, ds2});
std::cout << mergeds.size() << "\n"; // 10
std::cout << allClose(mergeds.get(5)[0], ds1->get(5)[0]) << "\n"; // 1
std::cout << allClose(mergeds.get(5)[1], ds2->get(5)[0]) << "\n"; // 1

Public Functions

explicit MergeDataset(const std::vector<std::shared_ptr<const Dataset>> &datasets)¶

Creates a MergeDataset.

Parameters:: datasets – [in] The underlying datasets.

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

ResampleDataset¶

class ResampleDataset : public fl::Dataset ¶

A view into a dataset, with indices remapped.

Note: the mapping doesn’t have to be bijective.

Example:

// Make a dataset with 10 samples
auto tensor = fl::rand({5, 4, 10});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Resample it by reversing it
auto permfn = [ds](int64_t x) { return ds->size() - 1 - x; };
ResampleDataset resampleds(ds, permfn);
std::cout << resampleds.size() << "\n"; // 10
std::cout << allClose(resampleds.get(9)[0], ds->get(0)[0]) << "\n"; // 1

Subclassed by fl::ShuffleDataset

Public Functions

explicit ResampleDataset(std::shared_ptr<const Dataset> dataset)¶

Constructs a ResampleDataset with the identity mapping: ResampleDataset(ds)->get(i) == ds->get(i)

Parameters:: dataset – [in] The underlying dataset.

ResampleDataset(std::shared_ptr<const Dataset> dataset, std::vector<int64_t> resamplevec)¶

Constructs a ResampleDataset with mapping specified by a vector: ResampleDataset(ds, v)->get(i) == ds->get(v[i])

Parameters:

dataset – [in] The underlying dataset.
resamplevec – [in] The vector specifying the mapping.

ResampleDataset(std::shared_ptr<const Dataset> dataset, const PermutationFunction &resamplefn, int n = -1)¶

Constructs a ResampleDataset with mapping specified by a function: ResampleDataset(ds, fn)->get(i) == ds->get(fn(i)) The function should be deterministic.

Parameters:

dataset – [in] The underlying dataset.
resamplefn – [in] The function specifying the mapping.
n – [in] The size of the new dataset (if -1, uses previous size)

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

void resample(std::vector<int64_t> resamplevec)¶

Changes the mapping used to resample the dataset.

Parameters:: resamplevec – [in] The vector specifying the new mapping.

ShuffleDataset¶

class ShuffleDataset : public fl::ResampleDataset ¶

A view into a dataset, with indices permuted randomly.

Example:

// Make a dataset with 100 samples
auto tensor = fl::rand({5, 4, 100});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Shuffle it
ShuffleDataset shuffleds(ds);
std::cout << shuffleds.size() << "\n"; // 100
std::cout << "first try" << shuffleds.get(0)["x"] << std::endl;

// Reshuffle it
shuffleds.resample();
std::cout << "second try" << shuffleds.get(0)["x"] << std::endl;

Public Functions

explicit ShuffleDataset(std::shared_ptr<const Dataset> dataset, int seed = 0)¶

Creates a ShuffleDataset.

Parameters:

dataset – [in] The underlying dataset.
[seed] – seed initial seed to be used.

void resample()¶: Generates a new random permutation for the dataset.

void setSeed(int seed)¶

Sets the PRNG seed.

Parameters:: seed – [in] The desired seed.

SpanDataset¶

class SpanDataset : public fl::Dataset ¶

A view into an underlying dataset with an offset and optional bounded length.

The size of the SpanDataset is either specified for the size of the input dataset accounting for the offset.

We have, for example SpanDataset(ds, 13).get(i) == ds.get(13 + i)

Example:

// Make a datasets
auto makeDataset = []() {
  auto tensor = fl::rand({5, 4, 10});
  std::vector<Tensor> fields{tensor};
  return std::make_shared<TensorDataset>(fields);
};
auto ds = makeDataset();

// Create two spanned datasets
SpanDataset spands1(ds, 2);
SpanDataset spands2(ds, 0, 2);
std::cout << spands1.size() << "\n"; // 8
std::cout << spands2.size() << "\n"; // 2
std::cout << allClose(spands1.get(3)[0], ds->get(5)[0]) << "\n"; // 1
std::cout << allClose(spands2.get(1)[1], ds->get(1)[0]) << "\n"; // 1

Public Functions

explicit SpanDataset(std::shared_ptr<const Dataset> dataset, const int64_t offset, const int64_t length = -1)¶

Creates a SpanDataset.

Parameters:

dataset – [in] The underlying dataset.
offset – [in] The starting index of the new dataset relative to the underlying dataset.
length – [in] The size of the new dataset (if -1, uses previous size minus the offset)

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

TensorDataset¶

class TensorDataset : public fl::Dataset ¶

Dataset created by unpacking tensors along the last non-singleton dimension.

The size of the dataset is determined by the size along that dimension. Hence, it must be the same across all int64_ts in the input.

Example:

Tensor tensor1 = fl::rand({5, 4, 10});
Tensor tensor2 = fl::rand({7, 10});
TensorDataset ds({tensor1, tensor2});

std::cout << ds.size() << "\n"; // 10
std::cout << ds.get(0)[0].shape() << "\n"; // 5 4
std::cout << ds.get(0)[1].shape() << "\n"; // 7 1

Public Functions

explicit TensorDataset(const std::vector<Tensor> &datatensors)¶

Creates a TensorDataset by unpacking the input tensors.

Parameters:: datatensors – [in] A vector of tensors, which will be unpacked along their last non-singleton dimensions.

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

TransformDataset¶

class TransformDataset : public fl::Dataset ¶

A view into a dataset with values transformed via the specified function(s).

A different transformation may be specified for each array in the input. A null TransformFunction specifies the identity transformation. The dataset size remains unchanged.

Example:

// Make a dataset with 10 samples
auto tensor = fl::rand({5, 4, 10});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Transform it
auto negate = [](const Tensor& arr) { return -arr; };
TransformDataset transformds(ds, {negate});
std::cout << transformds.size() << "\n"; // 10
std::cout << allClose(transformds.get(5)[0], -ds->get(5)[0]) << "\n"; // 1

Public Functions

TransformDataset(std::shared_ptr<const Dataset> dataset, const std::vector<TransformFunction> &transformfns)¶

Creates a TransformDataset.

Parameters:

dataset – [in] The underlying dataset.
transformfns – [in] The mappings used to transform the values. If a TransformFunction is null then the corresponding value is not transformed.

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

PrefetchDataset¶

class PrefetchDataset : public fl::Dataset ¶

A view into a dataset, where a given number of samples are prefetched in advance in a ThreadPool.

PrefetchDataset should be used when there is a sequential access to the underlying dataset. Otherwise, there will a lot of cache misses leading to a degraded performance.

Example:

// Make a dataset with 100 samples
auto tensor = fl::rand({5, 4, 100});
std::vector<Tensor> fields{tensor};
auto ds = std::make_shared<TensorDataset>(fields);

// Iterate over the dataset using 4 background threads prefetching 2 samples
// in advance
for (auto& sample : PrefetchDataset(ds, 4, 2)) {
    // do something
}

Public Functions

explicit PrefetchDataset(std::shared_ptr<const Dataset> dataset, int64_t numThreads, int64_t prefetchSize)¶

Creates a PrefetchDataset.

Parameters:

dataset – [in] The underlying dataset.
numThreads – [in] Number of threads used by the threadpool
prefetchSize – [in] Number of samples to be prefetched

virtual int64_t size() const override¶

Returns:: The size of the dataset.

virtual std::vector<Tensor> get(const int64_t idx) const override¶

Parameters:: idx – [in] Index of the sample in the dataset. Must be in [0, size()).
Returns:: The sample fields (a std::vector<Tensor>).

Utils¶

FL_API std::vector< int64_t > partitionByRoundRobin (int64_t numSamples, int64_t partitionId, int64_t numPartitions, int64_t batchSz=1, bool allowEmpty=false)

Partitions the samples in a round-robin manner and return ids of the samples.

For dealing with end effects, we include final samples iff we can fit atleast one sample for last batch for all partitions

Parameters:

numSamples – total number of samples
partitionId – rank of the current partition [0, numPartitions)
numPartitions – total partitions
batchSz – batchsize to be used

FL_API std::pair< std::vector< int64_t >, std::vector< int64_t > > dynamicPartitionByRoundRobin (const std::vector< float > &samplesSize, int64_t partitionId, int64_t numPartitions, int64_t maxSizePerBatch, bool allowEmpty=false)

Partitions the samples in a round-robin manner and return ids of the samples with dynamic batching: max number of tokens in the batch (including padded tokens) should be maxTokens.

Parameters:

samplesSize – samples length in tokens
partitionId – rank of the current partition [0, numPartitions)
numPartitions – total partitions
maxTokens – total number of tokens in the batch

FL_API Tensor makeBatch (const std::vector< Tensor > &data, const Dataset::BatchFunction &batchFn={})

Make batch by applying batchFn to the data.

Parameters:

data – data to be batchified
batchFn – function which is applied to make a batch

FL_API std::vector< Tensor > makeBatchFromRange (std::shared_ptr< const Dataset > dataset, std::vector< Dataset::BatchFunction > batchFns, int64_t start, int64_t end)

Make batch from part of indices (range [start, end) ) by applying set of batch functions.

Parameters:

data – dataset from which we take particular samples
batchFns – set of functions which are applied to make a batch
start – start index
end – end index