Launchers
Stoke supports the following launchers...
PyTorch DDP
Prefer the torch.distributed.launch
utility described
here (Note: the local_rank requirement
propagates through to stoke
)
python -m torch.distributed.launch,'--nproc_per_node=NUM_GPUS_YOU_HAVE, --use_env
Horovod
Refer to the docs here
horovodrun -np 4 -H localhost:4 python train.py
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Horovod w/ OpenMPI
Refer to the docs here. Can also be used with k8s via the MPI Operator
mpirun -np 4 \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
Deepspeed w/ OpenMPI
Prefer the OpenMPI version here over the native launcher. Deepspeed will automatically discover devices, etc. via mpi4py. Can also be used with k8s via the MPI Operator
mpirun -np 4 \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
PyTorch DDP w/ OpenMPI
Leverage Deepspeed functionality to automatically discover devices, etc. via mpi4py. Can also be used with k8s via the MPI Operator
mpirun -np 4 \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py