tools+configs is the most basic training method of HAT at the moment. But in many cases, we need to deal with multi-card or distributed environments. These environments need to rely on some third-party libraries in order to be able to organize the basic training approach in multiple environments in an efficient way.
In a multi-card or distributed environment, the common boot methods are torchrun and so on. Here we briefly talk about the differences between these boot methods using commands.
Here is an example of float training with resnet18, and all the current HAT supported startup methods are listed for reference.
This mode supports a range of single-machine single-card and single-machine multi-card two modes and does not support multi-machine multi-card, whose configuration only involves the modification of the index of device_ids in the configs.
Note that the mode of single-machine multi-card is actually implemented with the help of torch.multiprocess, which effectively manages all the processes inside the single machine, which is why it does not support multiple machines.
Note that torchrun requires a version of torch greater than or equal to 1.10.0, as those lower than that use torch.distributed.launch, which is not supported by HAT, and is not recommended to use.
Torchrun is the start-up tool provided by the torch framework to allow users to easily and quickly handle various environment variables inside the distributed environment.
For details on Torchrun, see Pytorch Community Documents.
Finally, note that both Python multiprocess and Torchrun are process managers, and the method of communication between processes relies on the initialization method of the process group inside Torch, so the differences in management programs do not affect the training efficiency.
The most important difference among different process management programs is the process management methods they use: for example, when a process quits abnormally, whether it can get the error messages of all nodes from the main process. Or, when a single process throws an exception, whether it can make sure that all the processes exit completely. As for other aspects, such as internal development modes, the difference are not that big.