As described in the previous section introducing the framework, all the Data, Model, Callback and other sub-modules will eventually be put into the Engine for execution after being built. As the execution engine of the whole HAT, the Engine has great importance.
In HAT, Engine defines the entire Pipeline for the training and prediction. For any deep-learning project, it is necessary to complete the training and prediction tasks for a given model. Therefore, this section focuses on the implementation of the Engine module in HAT.
The most basic PipeBase of the entire HAT Engine defines all the Callbacks operable running phases, while LoopBase defines the basic execution flow of all Engines. As shown in the figure above, the entire Engine execution flow is composed of these parts: a variety of Callbacks and the processing operations of the Processor associated with the model.
The Callbacks operable running phases include eight phases: on_loop_begin, on_epoch_begin, on_step_begin, on_batch_begin, on_batch_end, on_step_end, on_epoch _end, and on_loop_end. The execution order is shown in the above figure, and users can use different Callbacks at different stages or use the same Callback at different stages as per their needs. For example, the common LrUpdater can be used in both on_epoch_begin and on_step_begin phases, and similarly, the scope of other parts of the Callback is shown in the figure.
BatchProcessor is responsible for how data and models in the current Batch are run, including basic operations commonly seen in model processing such as forward and backward. In addition, some of the grad update operations are also defined here. Note that some complex training tasks also require BatchProcessor to complete more iterations and richer grad operations.
Based on LoopBase, a rich set of execution engines can be derived, as shown in the Engine relationship diagram above.
From the functional point of view, Trainer that focuses on training and Predictor that focuses on prediction can be derived from LoopBase.
Trainer: Responsible for all the training-related processes, which are generally needed by deep-learning related training tasks.
Predictor: Responsible for prediction related processes, and is commonly used in scenarios such as Validation.
Depending on the execution method, different training methods can generate different Trainers, such as DistibutedDataParallelTrainer based on torch.nn.parallel.DistributedDataParallel and DataParallelTrainer based on torch.nn.DataParallel, etc. Meanwhile, different execution methods require different launching methods. For related details, refer to launcher in different Trainers.