Start-up Method

tools+configs is the most basic training method of HAT at the moment. But in many cases, we need to deal with multi-card or distributed environments. These environments need to rely on some third-party libraries in order to be able to organize the basic training approach in multiple environments in an efficient way.

In a multi-card or distributed environment, the common boot methods are torchrun and so on. Here we briefly talk about the differences between these boot methods using commands.

Here is an example of float training with resnet18, and all the current HAT supported startup methods are listed for reference.

The Simplest Mode

This mode supports a range of single-machine single-card and single-machine multi-card two modes and does not support multi-machine multi-card, whose configuration only involves the modification of the index of device_ids in the configs.

Note that the mode of single-machine multi-card is actually implemented with the help of torch.multiprocess, which effectively manages all the processes inside the single machine, which is why it does not support multiple machines.

# Single-machine Single-card python3 tools/train.py --config configs/classification/resnet18.py --stage float # Single-machine Multi-card (by setting device_ids in config) python3 tools/train.py --config configs/classification/resnet18.py --stage float

Torchrun

Note that torchrun requires a version of torch greater than or equal to 1.10.0, as those lower than that use torch.distributed.launch, which is not supported by HAT, and is not recommended to use.

Torchrun is the start-up tool provided by the torch framework to allow users to easily and quickly handle various environment variables inside the distributed environment.

For details on Torchrun, see Pytorch Community Documents.

# Single-machine Single-card (not recommended for Torchrun) torchrun --nproc_per_node=1 tools/train.py --config configs/classification/resnet18.py --stage float --launcher torch # Single-machine Multi-card (nproc_per_node and config in device_ids card number to keep consistent) # If you need to specify card training you need to add CUDA_VISIBLE_DEVICES separately CUDA_VISIBLE_DEVICES=0,1,2,3” torchrun --nproc_per_node=4 tools/train.py --config configs/classification/resnet18.py --stage float --launcher torch # Multi-machine Multi-card # Node1: torchrun --nnodes=2 --nproc_per_node=4 --rdzv_id=8888 --rdzv_backend=c10d --rdzv_endpoint=hostip1 tools/train.py --config configs/classification/resnet18.py --stage float --launcher torch # Node2 (rdzv_id should be exactly the same as node1): torchrun --nnodes=2 --nproc_per_node=4 --rdzv_id=8888 --rdzv_backend=c10d --rdzv_endpoint=hostip1 tools/train.py --config configs/classification/resnet18.py --stage float --launcher torch

Finally, note that both Python multiprocess and Torchrun are process managers, and the method of communication between processes relies on the initialization method of the process group inside Torch, so the differences in management programs do not affect the training efficiency.

The most important difference among different process management programs is the process management methods they use: for example, when a process quits abnormally, whether it can get the error messages of all nodes from the main process. Or, when a single process throws an exception, whether it can make sure that all the processes exit completely. As for other aspects, such as internal development modes, the difference are not that big.