tensorflow/third_party/mpi/.gitignore
Joel Hestness c8535f36d1 Introduce MPI allreduce and allgather in a new contrib project (#12299)
* Allreduce: Rebase to TF 1.3-rc1 (#3)

* Introduce MPI allreduce in a new contrib project.

This commit adds the tensorflow.contrib.mpi namespace and contrib
project, which has a variety of ops that work with MPI.

The MPI system works by starting a background thread which communicates
between the different processes at a regular interval and schedules
asynchronous reductions. At every tick, every rank will notify rank zero
of the tensors it is ready to reduce, signifying completion with an
empty DONE message. Rank zero will count how many ranks are ready to
reduce every tensor, and, whenever a tensor is ready to reduce (that is,
every rank is ready to reduce it), rank zero will issue a message to all
other ranks directing them to reduce that tensor.  This repeats for all
the tensors that are ready to reduce, after which rank zero sends all
other ranks a DONE message indicating that the tick is complete.

Reviewed-by: Joel Hestness <jthestness@gmail.com>

* Allreduce/Allgather: Major changes and fixes (#2)

This commit constitutes many major updates to the TF MPI allreduce and
allgather ops. Specifically, the following changes are included in this
commit:
1) The allreduce and allgather ops had race conditions, which this commit
fixes. Specifically, the BackgroundThreadLoop previously allocated temporary
and output tensors after the main graph traversal thread has completed its
call to MPIAll*::ComputeAsync(). Unfortunately, the ops kernel context's
memory allocator is only guaranteed to be valid during the ComputeAsync call.
This constraint requires ComputeAsync to allocate all tensors before
returning; Otherwise, the memory allocator state may reflect allocations and
deallocations from further ops that can cause races for the memory locations.
To fix this, hoist the memory allocations to ComputeAsync. In this process,
introduce a collective op record, which tracks the parameters of the op (e.g.
input, output, and configurations).

2) Many models require capability to allreduce or allgather int64 tensors. We
add functionality to handle long long data type (64-bit ints).

3) Eliminate the thread sleep. A major to-do item is to eliminate the need for
polling between coordinator threads and other ranks. This change will require
the coordinator rank to be able to wake up all other ranks when a collective
is ready to be performed, but also for all ranks (i.e. background threads) to
be woken up by graph traversal threads. In the meantime, remove the thread
sleep, because it introduces significant run time overhead (e.g. >20%) for
models with quick-running layers (e.g. few recurrent time-steps or few hidden
nodes per layer).

* mpi_ops.cc: Move toward more TF nature

This commit changes a few bits and pieces to align more closely with
Tensorflow structures and organization:

1) Use TF mutexes. TF mutexes provide nice scoping and management around
std::mutex, and using them is consistent with other TF code.

2) Remove thread sleep at MPI initialization time. Thread sleep should not
be used for polling activity. Instead, this commit replaces sleep-polling
with a condition variable: The compute graph traversal thread waits on the
condition variable until the background thread has completed initialization
and signals the graph traversal thread that initialization is complete.

3) Slim MPI initialization check: Since TF permits many threads to be
traversing the compute graph concurrently (e.g. with
inter_op_parallelism_threads > 1), some graph traversal threads may not
have set their GPU device ID. If such a thread executes an MPI op, it would
fail the check in InitializedMPIOnSameDevice, because the background thread
would be controlling a GPU with ID other than the default (0). Since graph
traversal threads do not perform GPU activity, this GPU ID check was
unnecessary. Remove it and refactor to just check whether MPI is
initialized (IsMPIInitialized).

* Rebase to TF 1.3.0-rc1 complete and tested

* Allreduce: Rebase to TF 1.3-rc1 (#3)

* Introduce MPI allreduce in a new contrib project.

This commit adds the tensorflow.contrib.mpi namespace and contrib
project, which has a variety of ops that work with MPI.

The MPI system works by starting a background thread which communicates
between the different processes at a regular interval and schedules
asynchronous reductions. At every tick, every rank will notify rank zero
of the tensors it is ready to reduce, signifying completion with an
empty DONE message. Rank zero will count how many ranks are ready to
reduce every tensor, and, whenever a tensor is ready to reduce (that is,
every rank is ready to reduce it), rank zero will issue a message to all
other ranks directing them to reduce that tensor.  This repeats for all
the tensors that are ready to reduce, after which rank zero sends all
other ranks a DONE message indicating that the tick is complete.

Reviewed-by: Joel Hestness <jthestness@gmail.com>

* Allreduce/Allgather: Major changes and fixes (#2)

This commit constitutes many major updates to the TF MPI allreduce and
allgather ops. Specifically, the following changes are included in this
commit:
1) The allreduce and allgather ops had race conditions, which this commit
fixes. Specifically, the BackgroundThreadLoop previously allocated temporary
and output tensors after the main graph traversal thread has completed its
call to MPIAll*::ComputeAsync(). Unfortunately, the ops kernel context's
memory allocator is only guaranteed to be valid during the ComputeAsync call.
This constraint requires ComputeAsync to allocate all tensors before
returning; Otherwise, the memory allocator state may reflect allocations and
deallocations from further ops that can cause races for the memory locations.
To fix this, hoist the memory allocations to ComputeAsync. In this process,
introduce a collective op record, which tracks the parameters of the op (e.g.
input, output, and configurations).

2) Many models require capability to allreduce or allgather int64 tensors. We
add functionality to handle long long data type (64-bit ints).

3) Eliminate the thread sleep. A major to-do item is to eliminate the need for
polling between coordinator threads and other ranks. This change will require
the coordinator rank to be able to wake up all other ranks when a collective
is ready to be performed, but also for all ranks (i.e. background threads) to
be woken up by graph traversal threads. In the meantime, remove the thread
sleep, because it introduces significant run time overhead (e.g. >20%) for
models with quick-running layers (e.g. few recurrent time-steps or few hidden
nodes per layer).

* mpi_ops.cc: Move toward more TF nature

This commit changes a few bits and pieces to align more closely with
Tensorflow structures and organization:

1) Use TF mutexes. TF mutexes provide nice scoping and management around
std::mutex, and using them is consistent with other TF code.

2) Remove thread sleep at MPI initialization time. Thread sleep should not
be used for polling activity. Instead, this commit replaces sleep-polling
with a condition variable: The compute graph traversal thread waits on the
condition variable until the background thread has completed initialization
and signals the graph traversal thread that initialization is complete.

3) Slim MPI initialization check: Since TF permits many threads to be
traversing the compute graph concurrently (e.g. with
inter_op_parallelism_threads > 1), some graph traversal threads may not
have set their GPU device ID. If such a thread executes an MPI op, it would
fail the check in InitializedMPIOnSameDevice, because the background thread
would be controlling a GPU with ID other than the default (0). Since graph
traversal threads do not perform GPU activity, this GPU ID check was
unnecessary. Remove it and refactor to just check whether MPI is
initialized (IsMPIInitialized).

* Rebase to TF 1.3.0-rc1 complete and tested

* Minor fixes

* Point MPI message proto at contrib/mpi package

* MPI Session: Fix graph handling

* Pylint fixes

* More pylint fixes

* Python 2 pylint fix

* MPI Collectives Ops: Fix coordinator shut down

* Update copyrights to 2017

* Remove MPIDataType and switch to TF DataType

* Add Allgather test, fix Allreduce test config

* Fix BUILD file for TF sanity checks

* Try guarding MPI collectives C++ files with TENSORFLOW_USE_MPI

The TF build system on Github tries to build C++ source files in
tensorflow/contrib/mpi_collectives even when configured with TF_NEED_MPI=0.
This leads to a build failure when the mpi_collectives C++ files try to link
against MPI third party headers, which are not set up. Unable to reproduce
in contributor's build environment, we try guarding the MPI collectives C++
code with defines for TENSORFLOW_USE_MPI, similar to tensorflow/contrib/mpi.

* Comment formatting

Hopefully, this will trigger googlebot.
2017-09-18 11:14:20 -07:00

4 lines
17 B
Plaintext