Data Loading¶
Datasets¶
Dataset¶
-
class Dataset¶
Abstract class representing a dataset: a mapping index -> sample, where a sample is a vector of
Tensor
s.Can be extended to concat, split, batch, resample, etc. datasets.
A
Dataset
can either own its data directly, or throughshared_ptr
ownership of underlyingDataset
s.Subclassed by fl::BatchDataset, fl::BlobDataset, fl::ConcatDataset, fl::MergeDataset, fl::PrefetchDataset, fl::ResampleDataset, fl::SpanDataset, fl::TensorDataset, fl::TransformDataset, fl::pkg::speech::ListFileDataset, fl::pkg::text::TextDataset, fl::pkg::vision::DistributedDataset, fl::pkg::vision::LoaderDataset< T >, fl::pkg::vision::TransformAllDataset
Public Types
-
using PermutationFunction = std::function<int64_t(int64_t)>¶
A bijective mapping of dataset indices \([0, n) \to [0, n)\).
-
using LoadFunction = std::function<Tensor(const std::string&)>¶
A function to load data from a file into an array.
-
using BatchFunction = std::function<Tensor(const std::vector<Tensor>&)>¶
A function to pack arrays into a batched array.
-
using PermutationFunction = std::function<int64_t(int64_t)>¶
BatchDatasetPolicy¶
-
enum class fl::BatchDatasetPolicy¶
Policy for handling corner cases when the dataset size is not exactly divisible by
batchsize
while performing batching.Values:
-
enumerator INCLUDE_LAST¶
The last samples not evenly divisible by
batchsize
are packed into a smaller-than-usual batch.
-
enumerator SKIP_LAST¶
The last samples not evenly divisible by
batchsize
are skipped.
-
enumerator DIVISIBLE_ONLY¶
Constructor raises an error if sizes are not divisible.
-
enumerator INCLUDE_LAST¶
BatchDataset¶
-
class BatchDataset : public fl::Dataset¶
A view into a dataset where samples are packed into batches.
By default, for each field, the inputs must all have the same dimensions, and it batches along the first singleton dimension.
Example:
// Make a dataset containing 42 tensors of shape [5, 4] auto tensor = fl::rand({5, 4, 42}); std::vector<Tensor> fields{{tensor}}; auto ds = std::make_shared<TensorDataset>(fields); // Batch them with batchsize=10 BatchDataset batchds(ds, 10, BatchDatasetPolicy::INCLUDE_LAST); std::cout << batchds.get(0)[0].shape() << "\n"; // 5 4 10 1 std::cout << batchds.get(4)[0].shape() << "\n"; // 5 4 2 1 // create batch sizes vector specifying the each batch size (dynamic) std::vector<int64_t> batchSizes = {5, 10, 5, 10, 2, 10} // Batch them with batchSizes DynamicBatchDataset batchdsDynamic(ds, batchSizes); std::cout << batchdsDynamic.get(0)[0].shape() << "\n"; // 5 4 5 1 std::cout << batchdsDynamic.get(5)[0].shape() << "\n"; // 5 4 10 1
Public Functions
Creates a
BatchDataset
.- Parameters:
dataset – [in] The underlying dataset.
batchsize – [in] The desired batch size.
policy – [in] How to handle the last batch if sizes are indivisible.
batchfns – [in] Custom batch function to use for difference indices.
Creates a
BatchDataset
.- Parameters:
dataset – [in] The underlying dataset.
batchSizes – [in] desired batch sizes (dynamic).
batchfns – [in] Custom batch function to use for difference indices.
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
ConcatDataset¶
-
class ConcatDataset : public fl::Dataset¶
A view into two or more underlying datasets with the indexes concatenated in sequential order.
Example:
// Make two datasets with sizes 10 and 20 auto makeDataset = [](int size) { auto tensor = fl::rand({5, 4, size}); std::vector<Tensor> fields{tensor}; return std::make_shared<TensorDataset>(fields); }; auto ds1 = makeDataset(10); auto ds2 = makeDataset(20); // Concatenate them ConcatDataset concatds({ds1, ds2}); std::cout << concatds.size() << "\n"; // 30 std::cout << allClose(concatds.get(15)[0], ds2->get(5)[0]) << "\n"; // 1
Public Functions
Creates a
ConcatDataset
.- Parameters:
datasets – [in] The underlying datasets.
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
MergeDataset¶
-
class MergeDataset : public fl::Dataset¶
A view into two or more underlying datasets with the same indexes, but with fields combined from all the datasets.
The size of the
MergeDataset
is the max of the sizes of the input datasets.We have
MergeDataset({ds1, ds2}).get(i) == merge(ds1.get(i), ds2.get(i))
wheremerge
concatenates thestd::vector<Tensor>
from each dataset.Example:
// Make two datasets auto makeDataset = []() { auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; return std::make_shared<TensorDataset>(fields); }; auto ds1 = makeDataset(); auto ds2 = makeDataset(); // Merge them MergeDataset mergeds({ds1, ds2}); std::cout << mergeds.size() << "\n"; // 10 std::cout << allClose(mergeds.get(5)[0], ds1->get(5)[0]) << "\n"; // 1 std::cout << allClose(mergeds.get(5)[1], ds2->get(5)[0]) << "\n"; // 1
Public Functions
Creates a MergeDataset.
- Parameters:
datasets – [in] The underlying datasets.
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
ResampleDataset¶
-
class ResampleDataset : public fl::Dataset¶
A view into a dataset, with indices remapped.
Note: the mapping doesn’t have to be bijective.
Example:
// Make a dataset with 10 samples auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Resample it by reversing it auto permfn = [ds](int64_t x) { return ds->size() - 1 - x; }; ResampleDataset resampleds(ds, permfn); std::cout << resampleds.size() << "\n"; // 10 std::cout << allClose(resampleds.get(9)[0], ds->get(0)[0]) << "\n"; // 1
Subclassed by fl::ShuffleDataset
Public Functions
Constructs a ResampleDataset with the identity mapping:
ResampleDataset(ds)->get(i) == ds->get(i)
- Parameters:
dataset – [in] The underlying dataset.
Constructs a ResampleDataset with mapping specified by a vector:
ResampleDataset(ds, v)->get(i) == ds->get(v[i])
- Parameters:
dataset – [in] The underlying dataset.
resamplevec – [in] The vector specifying the mapping.
Constructs a ResampleDataset with mapping specified by a function:
ResampleDataset(ds, fn)->get(i) == ds->get(fn(i))
The function should be deterministic.- Parameters:
dataset – [in] The underlying dataset.
resamplefn – [in] The function specifying the mapping.
n – [in] The size of the new dataset (if -1, uses previous size)
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
ShuffleDataset¶
-
class ShuffleDataset : public fl::ResampleDataset¶
A view into a dataset, with indices permuted randomly.
Example:
// Make a dataset with 100 samples auto tensor = fl::rand({5, 4, 100}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Shuffle it ShuffleDataset shuffleds(ds); std::cout << shuffleds.size() << "\n"; // 100 std::cout << "first try" << shuffleds.get(0)["x"] << std::endl; // Reshuffle it shuffleds.resample(); std::cout << "second try" << shuffleds.get(0)["x"] << std::endl;
Public Functions
Creates a
ShuffleDataset
.- Parameters:
dataset – [in] The underlying dataset.
[seed] – seed initial seed to be used.
-
void resample()¶
Generates a new random permutation for the dataset.
-
void setSeed(int seed)¶
Sets the PRNG seed.
- Parameters:
seed – [in] The desired seed.
SpanDataset¶
-
class SpanDataset : public fl::Dataset¶
A view into an underlying dataset with an offset and optional bounded length.
The size of the
SpanDataset
is either specified for the size of the input dataset accounting for the offset.We have, for example
SpanDataset(ds, 13).get(i) == ds.get(13 + i)
Example:
// Make a datasets auto makeDataset = []() { auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; return std::make_shared<TensorDataset>(fields); }; auto ds = makeDataset(); // Create two spanned datasets SpanDataset spands1(ds, 2); SpanDataset spands2(ds, 0, 2); std::cout << spands1.size() << "\n"; // 8 std::cout << spands2.size() << "\n"; // 2 std::cout << allClose(spands1.get(3)[0], ds->get(5)[0]) << "\n"; // 1 std::cout << allClose(spands2.get(1)[1], ds->get(1)[0]) << "\n"; // 1
Public Functions
Creates a SpanDataset.
- Parameters:
dataset – [in] The underlying dataset.
offset – [in] The starting index of the new dataset relative to the underlying dataset.
length – [in] The size of the new dataset (if -1, uses previous size minus the offset)
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
TensorDataset¶
-
class TensorDataset : public fl::Dataset¶
Dataset created by unpacking tensors along the last non-singleton dimension.
The size of the dataset is determined by the size along that dimension. Hence, it must be the same across all
int64_t
s in the input.Example:
Tensor tensor1 = fl::rand({5, 4, 10}); Tensor tensor2 = fl::rand({7, 10}); TensorDataset ds({tensor1, tensor2}); std::cout << ds.size() << "\n"; // 10 std::cout << ds.get(0)[0].shape() << "\n"; // 5 4 std::cout << ds.get(0)[1].shape() << "\n"; // 7 1
Public Functions
-
explicit TensorDataset(const std::vector<Tensor> &datatensors)¶
Creates a
TensorDataset
by unpacking the input tensors.- Parameters:
datatensors – [in] A vector of tensors, which will be unpacked along their last non-singleton dimensions.
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
-
explicit TensorDataset(const std::vector<Tensor> &datatensors)¶
TransformDataset¶
-
class TransformDataset : public fl::Dataset¶
A view into a dataset with values transformed via the specified function(s).
A different transformation may be specified for each array in the input. A null TransformFunction specifies the identity transformation. The dataset size remains unchanged.
Example:
// Make a dataset with 10 samples auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Transform it auto negate = [](const Tensor& arr) { return -arr; }; TransformDataset transformds(ds, {negate}); std::cout << transformds.size() << "\n"; // 10 std::cout << allClose(transformds.get(5)[0], -ds->get(5)[0]) << "\n"; // 1
Public Functions
Creates a
TransformDataset
.- Parameters:
dataset – [in] The underlying dataset.
transformfns – [in] The mappings used to transform the values. If a
TransformFunction
is null then the corresponding value is not transformed.
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
PrefetchDataset¶
-
class PrefetchDataset : public fl::Dataset¶
A view into a dataset, where a given number of samples are prefetched in advance in a ThreadPool.
PrefetchDataset should be used when there is a sequential access to the underlying dataset. Otherwise, there will a lot of cache misses leading to a degraded performance.
Example:
// Make a dataset with 100 samples auto tensor = fl::rand({5, 4, 100}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Iterate over the dataset using 4 background threads prefetching 2 samples // in advance for (auto& sample : PrefetchDataset(ds, 4, 2)) { // do something }
Public Functions
Creates a
PrefetchDataset
.- Parameters:
dataset – [in] The underlying dataset.
numThreads – [in] Number of threads used by the threadpool
prefetchSize – [in] Number of samples to be prefetched
-
virtual int64_t size() const override¶
- Returns:
The size of the dataset.
Utils¶
-
std::vector<int64_t> partitionByRoundRobin(int64_t numSamples, int64_t partitionId, int64_t numPartitions, int64_t batchSz = 1, bool allowEmpty = false)¶
Partitions the samples in a round-robin manner and return ids of the samples.
For dealing with end effects, we include final samples iff we can fit atleast one sample for last batch for all partitions
- Parameters:
numSamples – total number of samples
partitionId – rank of the current partition [0, numPartitions)
numPartitions – total partitions
batchSz – batchsize to be used
-
std::pair<std::vector<int64_t>, std::vector<int64_t>> dynamicPartitionByRoundRobin(const std::vector<float> &samplesSize, int64_t partitionId, int64_t numPartitions, int64_t maxSizePerBatch, bool allowEmpty = false)¶
Partitions the samples in a round-robin manner and return ids of the samples with dynamic batching: max number of tokens in the batch (including padded tokens) should be maxTokens.
- Parameters:
samplesSize – samples length in tokens
partitionId – rank of the current partition [0, numPartitions)
numPartitions – total partitions
maxTokens – total number of tokens in the batch
-
Tensor makeBatch(const std::vector<Tensor> &data, const Dataset::BatchFunction &batchFn = {})¶
Make batch by applying batchFn to the data.
- Parameters:
data – data to be batchified
batchFn – function which is applied to make a batch
Make batch from part of indices (range [start, end) ) by applying set of batch functions.
- Parameters:
data – dataset from which we take particular samples
batchFns – set of functions which are applied to make a batch
start – start index
end – end index