Distributed Library¶

FL_API void distributedInit (DistributedInit initMethod, int worldRank, int worldSize, const std::unordered_map< std::string, std::string > &params={})

Initialize the distributed environment.

Note that worldSize, worldRank are ignored if DistributedInit::MPI is used.

Parameters:

initMethod – Initialization method used for setting up the rendezvous
worldSize – Total number of processes in the communication group `
worldRank – 0-indexed rank of the current process
params – Additional parameters (if any) needed for initialization

FL_API bool isDistributedInit (): Returns whether the distributed environment has been initialized.

FL_API DistributedBackend distributedBackend (): Returns the backend used for distributed setup.

FL_API int getWorldRank ()

Returns rank of the current process in the communication group (zero-based).

Returns 0 if distributed environment is not initialized

FL_API int getWorldSize (): Returns total process in the communication group Returns 1 if distributed environment is not initialized.

FL_API void allReduce (Variable &var, double scale=1.0, bool async=false)

Synchronizes a the array wrapped by the Variable with allreduce.

Parameters:

var – [in] a variable whose array will be synchronized
scale – [in] scale the Variable after allreduce by this factor
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if true, syncDistributed must be called in order to ensure the Flashlight CUDA stream waits until allReduce is complete and uses updated values.

FL_API void allReduce (Tensor &arr, bool async=false)

Synchronizes a single Flashlight array with allreduce.

Parameters:

arr – an array which will be synchronized
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used, syncDistributed must be called in order to ensure asynchrnous reduction and worker streams wait until allReduce is complete and uses updated values.

FL_API void allReduceMultiple (std::vector< Variable > vars, double scale=1.0, bool async=false, bool contiguous=false)

Synchronizes a the arrays wrapped by a vector of Variables with allreduce.

Parameters:

vars – [in] Variables whose arrays will be synchronized
scale – [in] scale the Variable after allreduce by this factor
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used, syncDistributed must be called in order to ensure asynchrnous reduction and worker streams wait until allReduce is complete and uses updated values.
contiguous – [in] copy data for each Variable into a contiguous buffer before performing the allReduce operation

FL_API void allReduceMultiple (std::vector< Tensor * > arrs, bool async=false, bool contiguous=false)

Synchronizes a vector of pointers to arrays with allreduce.

Parameters:

arrs – [in] a vector of pointers to arrays which will be synchronized
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used, syncDistributed must be called in order to ensure asynchrnous reduction and worker streams wait until allReduce is complete and uses updated values.
contiguous – [in] copy data for each Variable into a contiguous buffer before performing the allReduce operation

FL_API void syncDistributed ()

Synchronizes operations in the Flashlight compute stream with operations in the distributed compute stream, if applicable.

That is, all operations in the Flashlight compute stream will not be executed until operations currently enqueued on the distributed compute stream are finished executing.

Note that if asynchronous allReduce is not used, this operation will be a no-op, since no operations will be enqueued on the distributed compute stream.

FL_API void barrier (): Blocks until all CPU processes have reached this routine.

Reducer Framework¶

class Reducer¶

An interface for creating tensor reduction algorithms/rules.

In flashlight, a Reducer instance is typically used for gradient synchronization across devices/processes during training, although the API is general.

Subclassed by fl::CoalescingReducer, fl::InlineReducer

Public Functions

virtual ~Reducer() = default¶

virtual void add(Variable &var) = 0¶

Have the Reducer ingest a Variable.

What happens next is implementation-specific; the implementation may cache the value, process/synchronize immediately, or ignore the value.

Parameters:: var – [in] a Variable to be ingested

virtual void finalize() = 0¶

Forces a reduction/synchronization of the Reducer.

For some implementations, this may be a no-op if the Reducer immediately processes or synchronizes all gradients that are added.

Reducers¶

class InlineReducer : public fl::Reducer ¶

A Reducer which calls allReduce directly on gradients to process.

All synchronized gradients are scaled by a pre-specified factor.

Public Functions

explicit InlineReducer(double scale)¶

Creates a new InlineReducer with a given scaling factor.

Parameters:: scale – [in] the factor by which to scale gradients after synchronization

virtual void add(Variable &var) override¶

Ingest a Variable and immediately call allReduce on it.

Parameters:: var – [in] the Variable to process for synchronization

inline virtual void finalize() override¶

Forces a reduction/synchronization of the Reducer.

For some implementations, this may be a no-op if the Reducer immediately processes or synchronizes all gradients that are added.

class CoalescingReducer : public fl::Reducer ¶

A Reducer which coalesces added Variables in a cache until some maximum cache size is reached, after which all Variables in the cache are reduced and the cache is emptied.

Since the Reducer executes allReduceMultiple operations asynchronously, to guarantee that synchronized values are available after reduction, finalize must be called before using a given value.

Public Functions

CoalescingReducer(double scale, bool async, bool contiguous)¶

Creates a new coalescing reducer.

Parameters:

cache – [in] threshold at which the cache will be flushed and its contents synchronized, in bytes
async – [in] determines whether or not the distributed compute stream runs asynchronously to the AF stream.
contiguous – [in] forces synchronization of the set of Variables to occur in a contiguous buffer, which may improve performance.

~CoalescingReducer() override¶

Destroy the Reducer.

Calls finalize() before returning.

virtual void add(Variable &var) override¶

Add a Variable to Reducer.

Behaves as follows:

if the Variable exceeds the size of the coalescing cache, call allReduce immediately to synchronize.
if the Variable is smaller than the cache and adding it would push the cache oversize, flush the cache and synchronize with allReduceMultiple
otherwise, add the Variable to the cache.

virtual void finalize() override¶: Flush any remaining Variables in the cache and synchronize.