Distributed Library¶
-
void
fl
::
distributedInit
(DistributedInit initMethod, int worldRank, int worldSize, const std::unordered_map<std::string, std::string> ¶ms = {})¶ Initialize the distributed environment.
Note that
worldSize
,worldRank
are ignored ifDistributedInit::MPI
is used.- Parameters
initMethod
: Initialization method used for setting up the rendezvousworldSize
: Total number of processes in the communication group `worldRank
: 0-indexed rank of the current processparams
: Additional parameters (if any) needed for initialization
-
bool
fl
::
isDistributedInit
()¶ Returns whether the distributed environment has been initialized.
-
DistributedBackend
fl
::
distributedBackend
()¶ Returns the backend used for distributed setup.
-
int
fl
::
getWorldRank
()¶ Returns rank of the current process in the communication group (zero-based).
Returns 0 if distributed environment is not initialized
-
int
fl
::
getWorldSize
()¶ Returns total process in the communication group Returns 1 if distributed environment is not initialized.
-
void
fl
::
allReduce
(Variable &var, double scale = 1.0, bool async = false)¶ Synchronizes a the array wrapped by the Variable with allreduce.
- Parameters
[in] var
: a variable whose array will be synchronized[in] scale
: scale the Variable after allreduce by this factor[in] async
: perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if true,syncDistributed
must be called in order to ensure the Flashlight CUDA stream waits untilallReduce
is complete and uses updated values.
-
void
fl
::
allReduce
(Tensor &arr, bool async = false)¶ Synchronizes a single Flashlight array with allreduce.
- Parameters
arr
: an array which will be synchronized[in] async
: perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used,syncDistributed
must be called in order to ensure asynchrnous reduction and worker streams wait untilallReduce
is complete and uses updated values.
-
void
fl
::
allReduceMultiple
(std::vector<Variable> vars, double scale = 1.0, bool async = false, bool contiguous = false)¶ Synchronizes a the arrays wrapped by a vector of Variables with allreduce.
- Parameters
[in] vars
:Variable
s whose arrays will be synchronized[in] scale
: scale the Variable after allreduce by this factor[in] async
: perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used,syncDistributed
must be called in order to ensure asynchrnous reduction and worker streams wait untilallReduce
is complete and uses updated values.[in] contiguous
: copy data for each Variable into a contiguous buffer before performing the allReduce operation
-
void
fl
::
allReduceMultiple
(std::vector<Tensor *> arrs, bool async = false, bool contiguous = false)¶ Synchronizes a vector of pointers to arrays with allreduce.
- Parameters
[in] arrs
: a vector of pointers to arrays which will be synchronized[in] async
: perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used,syncDistributed
must be called in order to ensure asynchrnous reduction and worker streams wait untilallReduce
is complete and uses updated values.[in] contiguous
: copy data for each Variable into a contiguous buffer before performing the allReduce operation
-
void
fl
::
syncDistributed
()¶ Synchronizes operations in the Flashlight compute stream with operations in the distributed compute stream, if applicable.
That is, all operations in the Flashlight compute stream will not be executed until operations currently enqueued on the distributed compute stream are finished executing.
Note that if asynchronous allReduce is not used, this operation will be a no-op, since no operations will be enqueued on the distributed compute stream.
-
void
fl
::
barrier
()¶ Blocks until all CPU processes have reached this routine.
Reducer Framework¶
-
class
Reducer
¶ An interface for creating tensor reduction algorithms/rules.
In flashlight, a Reducer instance is typically used for gradient synchronization across devices/processes during training, although the API is general.
Subclassed by fl::CoalescingReducer, fl::InlineReducer
Reducers¶
-
class
InlineReducer
: public fl::Reducer¶ A Reducer which calls allReduce directly on gradients to process.
All synchronized gradients are scaled by a pre-specified factor.
Public Functions
-
InlineReducer
(double scale)¶ Creates a new InlineReducer with a given scaling factor.
- Parameters
[in] scale
: the factor by which to scale gradients after synchronization
-
-
class
CoalescingReducer
: public fl::Reducer¶ A Reducer which coalesces added Variables in a cache until some maximum cache size is reached, after which all Variables in the cache are reduced and the cache is emptied.
Since the Reducer executes
allReduceMultiple
operations asynchronously, to guarantee that synchronized values are available after reduction,finalize
must be called before using a given value.Public Functions
-
CoalescingReducer
(double scale, bool async, bool contiguous)¶ Creates a new coalescing reducer.
- Parameters
[in] cache
: threshold at which the cache will be flushed and its contents synchronized, in bytes[in] async
: determines whether or not the distributed compute stream runs asynchronously to the AF stream.[in] contiguous
: forces synchronization of the set of Variables to occur in a contiguous buffer, which may improve performance.
-
~CoalescingReducer
()¶ Destroy the Reducer.
Calls
finalize()
before returning.
-