Multi-machine Instructions

Environment Preparation

Prepare two development PCs and make sure that both of them can be logged in directly without typing passwords.

Make sure that the file directories and contents of the code to run on both machines are identical.

Launching Docker

Start the docker on the first PC.

nvidia-docker run -d -p 10022:22 -it --shm-size="15g" -v ~/.ssh:/root/.ssh ${DOCKER_PATH} docker exec -it 096e26686644 bash

Where 096e26686644 is the image name of the container started in the first step, which can be viewed using docker ps.

Start the docker on the second PC.

nvidia-docker run -d -p 10023:22 -it --shm-size="15g" -v ~/.ssh:/root/.ssh ${DOCKER_PATH} docker exec -it f0f456bd2824 bash

Note that the mapping port number for SSH is changed here. The first PC is 10022 and the second PC is 10023. Make sure these two PCs use different port numbers.

Make sure the code directories inside the docker on both PCs are identical. You can mount the same development directory using -v.

In addition, you can test whether the two docker containers can log into each other without passwords: Use the ifconfig command to get the IP address of the container on one machine (PC1). Suppose the PC1's IP is 172.17.0.12, go to the container on the other machine (PC2), run ssh -p 22 172.17.0.12 to check whether PC1 can be logged in. If the login is ok, then you can proceed to the next step.

Starting Multi-machine Multi-card Training Script with Torchrun

# node1: torchrun --nnodes=2 --nproc_per_node=4 --rdzv_id=8888 --rdzv_backend=c10d --rdzv_endpoint=hostip1 tools/train.py --config configs/classification/resnet18.py --stage float --launcher torch # node2(rdzv_id需要和node1完全一致): torchrun --nnodes=2 --nproc_per_node=4 --rdzv_id=8888 --rdzv_backend=c10d --rdzv_endpoint=hostip1 tools/train.py --config configs/classification/resnet18.py --stage float --launcher torch

hostip: IP address of the container on PC1, use ifconfig to check it.

--nnodes 2: 2 is the total number of PCs.

--nproc_per_node 4: 4 means the number of GPUs on each PC (you may need to manually change this number to 4 in configs/classification/mobilenetv1_imagenet.py).

Run this command to see the multi-machine multi-card instance running properly.