Data Loading¶
Datasets¶
Dataset¶
-
class
Dataset
¶ Abstract class representing a dataset: a mapping index -> sample, where a sample is a vector of
Tensor
s.Can be extended to concat, split, batch, resample, etc. datasets.
A
Dataset
can either own its data directly, or throughshared_ptr
ownership of underlyingDataset
s.Subclassed by fl::BatchDataset, fl::BlobDataset, fl::ConcatDataset, fl::MergeDataset, fl::pkg::speech::ListFileDataset, fl::pkg::text::TextDataset, fl::pkg::vision::DistributedDataset, fl::pkg::vision::LoaderDataset< T >, fl::pkg::vision::TransformAllDataset, fl::PrefetchDataset, fl::ResampleDataset, fl::TensorDataset, fl::TransformDataset
Public Types
-
using
PermutationFunction
= std::function<int64_t(int64_t)>¶ A bijective mapping of dataset indices \([0, n) \to [0, n)\).
-
using
LoadFunction
= std::function<Tensor(const std::string&)>¶ A function to load data from a file into an array.
-
using
BatchFunction
= std::function<Tensor(const std::vector<Tensor>&)>¶ A function to pack arrays into a batched array.
-
using
BatchDatasetPolicy¶
-
enum
fl
::
BatchDatasetPolicy
¶ Policy for handling corner cases when the dataset size is not exactly divisible by
batchsize
while performing batching.Values:
-
INCLUDE_LAST
= 0¶ The last samples not evenly divisible by
batchsize
are packed into a smaller-than-usual batch.
-
SKIP_LAST
= 1¶ The last samples not evenly divisible by
batchsize
are skipped.
-
DIVISIBLE_ONLY
= 2¶ Constructor raises an error if sizes are not divisible.
-
BatchDataset¶
-
class
BatchDataset
: public fl::Dataset¶ A view into a dataset where samples are packed into batches.
By default, for each field, the inputs must all have the same dimensions, and it batches along the first singleton dimension.
Example:
// Make a dataset containing 42 tensors of shape [5, 4] auto tensor = fl::rand({5, 4, 42}); std::vector<Tensor> fields{{tensor}}; auto ds = std::make_shared<TensorDataset>(fields); // Batch them with batchsize=10 BatchDataset batchds(ds, 10, BatchDatasetPolicy::INCLUDE_LAST); std::cout << batchds.get(0)[0].shape() << "\n"; // 5 4 10 1 std::cout << batchds.get(4)[0].shape() << "\n"; // 5 4 2 1 // create batch sizes vector specifying the each batch size (dynamic) std::vector<int64_t> batchSizes = {5, 10, 5, 10, 2, 10} // Batch them with batchSizes DynamicBatchDataset batchdsDynamic(ds, batchSizes); std::cout << batchdsDynamic.get(0)[0].shape() << "\n"; // 5 4 5 1 std::cout << batchdsDynamic.get(5)[0].shape() << "\n"; // 5 4 10 1
Public Functions
Creates a
BatchDataset
.- Parameters
[in] dataset
: The underlying dataset.[in] batchsize
: The desired batch size.[in] policy
: How to handle the last batch if sizes are indivisible.[in] batchfns
: Custom batch function to use for difference indices.
Creates a
BatchDataset
.- Parameters
[in] dataset
: The underlying dataset.[in] batchSizes
: desired batch sizes (dynamic).[in] batchfns
: Custom batch function to use for difference indices.
-
int64_t
size
() const¶ - Return
The size of the dataset.
ConcatDataset¶
-
class
ConcatDataset
: public fl::Dataset¶ A view into two or more underlying datasets with the indexes concatenated in sequential order.
Example:
// Make two datasets with sizes 10 and 20 auto makeDataset = [](int size) { auto tensor = fl::rand({5, 4, size}); std::vector<Tensor> fields{tensor}; return std::make_shared<TensorDataset>(fields); }; auto ds1 = makeDataset(10); auto ds2 = makeDataset(20); // Concatenate them ConcatDataset concatds({ds1, ds2}); std::cout << concatds.size() << "\n"; // 30 std::cout << allClose(concatds.get(15)[0], ds2->get(5)[0]) << "\n"; // 1
Public Functions
Creates a
ConcatDataset
.- Parameters
[in] datasets
: The underlying datasets.
-
int64_t
size
() const¶ - Return
The size of the dataset.
MergeDataset¶
-
class
MergeDataset
: public fl::Dataset¶ A view into two or more underlying datasets with the same indexes, but with fields combined from all the datasets.
The size of the
MergeDataset
is the max of the sizes of the input datasets.We have
MergeDataset({ds1, ds2}).get(i) == merge(ds1.get(i), ds2.get(i))
wheremerge
concatenates thestd::vector<Tensor>
from each dataset.Example:
// Make two datasets auto makeDataset = []() { auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; return std::make_shared<TensorDataset>(fields); }; auto ds1 = makeDataset(); auto ds2 = makeDataset(); // Merge them MergeDataset mergeds({ds1, ds2}); std::cout << mergeds.size() << "\n"; // 10 std::cout << allClose(mergeds.get(5)[0], ds1->get(5)[0]) << "\n"; // 1 std::cout << allClose(mergeds.get(5)[1], ds2->get(5)[0]) << "\n"; // 1
Public Functions
Creates a MergeDataset.
- Parameters
[in] datasets
: The underlying datasets.
-
int64_t
size
() const¶ - Return
The size of the dataset.
ResampleDataset¶
-
class
ResampleDataset
: public fl::Dataset¶ A view into a dataset, with indices remapped.
Note: the mapping doesn’t have to be bijective.
Example:
// Make a dataset with 10 samples auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Resample it by reversing it auto permfn = [ds](int64_t x) { return ds->size() - 1 - x; }; ResampleDataset resampleds(ds, permfn); std::cout << resampleds.size() << "\n"; // 10 std::cout << allClose(resampleds.get(9)[0], ds->get(0)[0]) << "\n"; // 1
Subclassed by fl::ShuffleDataset
Public Functions
Constructs a ResampleDataset with the identity mapping:
ResampleDataset(ds)->get(i) == ds->get(i)
- Parameters
[in] dataset
: The underlying dataset.
Constructs a ResampleDataset with mapping specified by a vector:
ResampleDataset(ds, v)->get(i) == ds->get(v[i])
- Parameters
[in] dataset
: The underlying dataset.[in] resamplevec
: The vector specifying the mapping.
Constructs a ResampleDataset with mapping specified by a function:
ResampleDataset(ds, fn)->get(i) == ds->get(fn(i))
The function should be deterministic.- Parameters
[in] dataset
: The underlying dataset.[in] resamplefn
: The function specifying the mapping.[in] n
: The size of the new dataset (if -1, uses previous size)
-
int64_t
size
() const¶ - Return
The size of the dataset.
ShuffleDataset¶
-
class
ShuffleDataset
: public fl::ResampleDataset¶ A view into a dataset, with indices permuted randomly.
Example:
// Make a dataset with 100 samples auto tensor = fl::rand({5, 4, 100}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Shuffle it ShuffleDataset shuffleds(ds); std::cout << shuffleds.size() << "\n"; // 100 std::cout << "first try" << shuffleds.get(0)["x"] << std::endl; // Reshuffle it shuffleds.resample(); std::cout << "second try" << shuffleds.get(0)["x"] << std::endl;
Public Functions
Creates a
ShuffleDataset
.- Parameters
[in] dataset
: The underlying dataset.
-
void
resample
()¶ Generates a new random permutation for the dataset.
-
void
setSeed
(int seed)¶ Sets the PRNG seed.
- Parameters
[in] seed
: The desired seed.
TensorDataset¶
-
class
TensorDataset
: public fl::Dataset¶ Dataset created by unpacking tensors along the last non-singleton dimension.
The size of the dataset is determined by the size along that dimension. Hence, it must be the same across all
int64_t
s in the input.Example:
Tensor tensor1 = fl::rand({5, 4, 10}); Tensor tensor2 = fl::rand({7, 10}); TensorDataset ds({tensor1, tensor2}); std::cout << ds.size() << "\n"; // 10 std::cout << ds.get(0)[0].shape() << "\n"; // 5 4 std::cout << ds.get(0)[1].shape() << "\n"; // 7 1
Public Functions
-
TensorDataset
(const std::vector<Tensor> &datatensors)¶ Creates a
TensorDataset
by unpacking the input tensors.- Parameters
[in] datatensors
: A vector of tensors, which will be unpacked along their last non-singleton dimensions.
-
int64_t
size
() const¶ - Return
The size of the dataset.
-
TransformDataset¶
-
class
TransformDataset
: public fl::Dataset¶ A view into a dataset with values transformed via the specified function(s).
A different transformation may be specified for each array in the input. A null TransformFunction specifies the identity transformation. The dataset size remains unchanged.
Example:
// Make a dataset with 10 samples auto tensor = fl::rand({5, 4, 10}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Transform it auto negate = [](const Tensor& arr) { return -arr; }; TransformDataset transformds(ds, {negate}); std::cout << transformds.size() << "\n"; // 10 std::cout << allClose(transformds.get(5)[0], -ds->get(5)[0]) << "\n"; // 1
Public Functions
Creates a
TransformDataset
.- Parameters
[in] dataset
: The underlying dataset.[in] transformfns
: The mappings used to transform the values. If aTransformFunction
is null then the corresponding value is not transformed.
-
int64_t
size
() const¶ - Return
The size of the dataset.
PrefetchDataset¶
-
class
PrefetchDataset
: public fl::Dataset¶ A view into a dataset, where a given number of samples are prefetched in advance in a ThreadPool.
PrefetchDataset should be used when there is a sequential access to the underlying dataset. Otherwise, there will a lot of cache misses leading to a degraded performance.
Example:
// Make a dataset with 100 samples auto tensor = fl::rand({5, 4, 100}); std::vector<Tensor> fields{tensor}; auto ds = std::make_shared<TensorDataset>(fields); // Iterate over the dataset using 4 background threads prefetching 2 samples // in advance for (auto& sample : PrefetchDataset(ds, 4, 2)) { // do something }
Public Functions
Creates a
PrefetchDataset
.- Parameters
[in] dataset
: The underlying dataset.[in] numThreads
: Number of threads used by the threadpool[in] prefetchSize
: Number of samples to be prefetched
-
int64_t
size
() const¶ - Return
The size of the dataset.
Utils¶
-
std::vector<int64_t>
fl
::
partitionByRoundRobin
(int64_t numSamples, int64_t partitionId, int64_t numPartitions, int64_t batchSz = 1, bool allowEmpty = false)¶ Partitions the samples in a round-robin manner and return ids of the samples.
For dealing with end effects, we include final samples iff we can fit atleast one sample for last batch for all partitions
- Parameters
numSamples
: total number of samplespartitionId
: rank of the current partition [0, numPartitions)numPartitions
: total partitionsbatchSz
: batchsize to be used
-
std::pair<std::vector<int64_t>, std::vector<int64_t>>
fl
::
dynamicPartitionByRoundRobin
(const std::vector<float> &samplesSize, int64_t partitionId, int64_t numPartitions, int64_t maxSizePerBatch, bool allowEmpty = false)¶ Partitions the samples in a round-robin manner and return ids of the samples with dynamic batching: max number of tokens in the batch (including padded tokens) should be maxTokens.
- Parameters
samplesSize
: samples length in tokenspartitionId
: rank of the current partition [0, numPartitions)numPartitions
: total partitionsmaxTokens
: total number of tokens in the batch
-
Tensor
fl
::
makeBatch
(const std::vector<Tensor> &data, const Dataset::BatchFunction &batchFn = {})¶ Make batch by applying batchFn to the data.
- Parameters
data
: data to be batchifiedbatchFn
: function which is applied to make a batch
Make batch from part of indices (range [start, end) ) by applying set of batch functions.
- Parameters
data
: dataset from which we take particular samplesbatchFns
: set of functions which are applied to make a batchstart
: start indexend
: end index