Distributed Library¶
-
void distributedInit(DistributedInit initMethod, int worldRank, int worldSize, const std::unordered_map<std::string, std::string> ¶ms = {})¶
Initialize the distributed environment.
Note that
worldSize
,worldRank
are ignored ifDistributedInit::MPI
is used.- Parameters:
initMethod – Initialization method used for setting up the rendezvous
worldSize – Total number of processes in the communication group `
worldRank – 0-indexed rank of the current process
params – Additional parameters (if any) needed for initialization
-
bool isDistributedInit()¶
Returns whether the distributed environment has been initialized.
-
DistributedBackend distributedBackend()¶
Returns the backend used for distributed setup.
-
int getWorldRank()¶
Returns rank of the current process in the communication group (zero-based).
Returns 0 if distributed environment is not initialized
-
int getWorldSize()¶
Returns total process in the communication group Returns 1 if distributed environment is not initialized.
-
void allReduce(Variable &var, double scale = 1.0, bool async = false)¶
Synchronizes a the array wrapped by the Variable with allreduce.
- Parameters:
var – [in] a variable whose array will be synchronized
scale – [in] scale the Variable after allreduce by this factor
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if true,
syncDistributed
must be called in order to ensure the Flashlight CUDA stream waits untilallReduce
is complete and uses updated values.
-
void allReduce(Tensor &arr, bool async = false)¶
Synchronizes a single Flashlight array with allreduce.
- Parameters:
arr – an array which will be synchronized
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used,
syncDistributed
must be called in order to ensure asynchrnous reduction and worker streams wait untilallReduce
is complete and uses updated values.
-
void allReduceMultiple(std::vector<Variable> vars, double scale = 1.0, bool async = false, bool contiguous = false)¶
Synchronizes a the arrays wrapped by a vector of Variables with allreduce.
- Parameters:
vars – [in]
Variable
s whose arrays will be synchronizedscale – [in] scale the Variable after allreduce by this factor
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used,
syncDistributed
must be called in order to ensure asynchrnous reduction and worker streams wait untilallReduce
is complete and uses updated values.contiguous – [in] copy data for each Variable into a contiguous buffer before performing the allReduce operation
-
void allReduceMultiple(std::vector<Tensor*> arrs, bool async = false, bool contiguous = false)¶
Synchronizes a vector of pointers to arrays with allreduce.
- Parameters:
arrs – [in] a vector of pointers to arrays which will be synchronized
async – [in] perform the allReduce operation asynchronously in a separate compute stream to the Flashlight compute stream. NB: if used,
syncDistributed
must be called in order to ensure asynchrnous reduction and worker streams wait untilallReduce
is complete and uses updated values.contiguous – [in] copy data for each Variable into a contiguous buffer before performing the allReduce operation
-
void syncDistributed()¶
Synchronizes operations in the Flashlight compute stream with operations in the distributed compute stream, if applicable.
That is, all operations in the Flashlight compute stream will not be executed until operations currently enqueued on the distributed compute stream are finished executing.
Note that if asynchronous allReduce is not used, this operation will be a no-op, since no operations will be enqueued on the distributed compute stream.
-
void barrier()¶
Blocks until all CPU processes have reached this routine.
Reducer Framework¶
-
class Reducer¶
An interface for creating tensor reduction algorithms/rules.
In flashlight, a Reducer instance is typically used for gradient synchronization across devices/processes during training, although the API is general.
Subclassed by fl::CoalescingReducer, fl::InlineReducer
Public Functions
-
virtual ~Reducer() = default¶
-
virtual ~Reducer() = default¶
Reducers¶
-
class InlineReducer : public fl::Reducer¶
A Reducer which calls allReduce directly on gradients to process.
All synchronized gradients are scaled by a pre-specified factor.
Public Functions
-
explicit InlineReducer(double scale)¶
Creates a new InlineReducer with a given scaling factor.
- Parameters:
scale – [in] the factor by which to scale gradients after synchronization
-
explicit InlineReducer(double scale)¶
-
class CoalescingReducer : public fl::Reducer¶
A Reducer which coalesces added Variables in a cache until some maximum cache size is reached, after which all Variables in the cache are reduced and the cache is emptied.
Since the Reducer executes
allReduceMultiple
operations asynchronously, to guarantee that synchronized values are available after reduction,finalize
must be called before using a given value.Public Functions
-
CoalescingReducer(double scale, bool async, bool contiguous)¶
Creates a new coalescing reducer.
- Parameters:
cache – [in] threshold at which the cache will be flushed and its contents synchronized, in bytes
async – [in] determines whether or not the distributed compute stream runs asynchronously to the AF stream.
contiguous – [in] forces synchronization of the set of Variables to occur in a contiguous buffer, which may improve performance.
-
~CoalescingReducer() override¶
Destroy the Reducer.
Calls
finalize()
before returning.
-
CoalescingReducer(double scale, bool async, bool contiguous)¶