This document only describes the operations needed to perform quantization training in HAT. For the basic principles of quantization and its implementation in the training framework, refer to the documentation of horizon_plugin_pytorch .
In quantized training, the conversion process from a floating-point model to a fixed-point model is as follows:
Most of these steps are already integrated in the HAT training pipeline, and the user only needs to pay attention to implementing the fuse_model method to complete the model fusion when adding a custom model and the set_qconfig method to configure the quantization method. The following points need to be noted when writing the models.
HAT will only call the fuse_model method of the outermost module, so the implementation of fuse_model is responsible for the fuse of all submodules.
Preference should be given to the base modules provided in hat.models.base_modules, which has already implemented the fuse_model method to reduce the effort and development difficulties.
Model registration, all the modules in HAT use the registration mechanism, only when the defined model is registered in the corresponding registration item, can the model be used in the config file as dict(type={$class_name}, ...) .
The set_qconfig method needs to be implemented in the outermost module. If there is a special layer in a submodule that needs a separate QConfig setting, the set_qconfig method needs to be implemented in that submodule as well, details of which can be found in the Writing Specifications of set_qconfig and Customization of qconfig sections.
In addition, to make the model transferable to a quantized model, some conditions need to be met, as described in the documentation for horizon_plugin_pytorch.
You only need to simply specify the training phases in order when using the tools/train.py script, and the corresponding solver will be called automatically according to the training phase to execute the training process.
Unexpectedly interrupted training can be resumed by configuring the resume_optimizer and resume_epoch_or_step fields in {stage}_trainer of config, or by resuming only the optimizer for fine-tuning. For example:
Training recovery has three scenarios:
Full Recovery: This scenario is to resume the training that was unexpectedly interrupted, and will restore all the states of the previous checkpoint, including optimizer, LR, epoch, step, and so on. In this scenario, you only need to configure the resume_optimizer field.
Resume Optimizer for Fine-tuning: This scenario will only restore the state of optimizer and LR, with epoch and step reset to 0 for the fine-tuning of certain tasks. In this scenario, you need to configure both resume_optimizer and resume_epoch_or_step=False.
Load Model Parameters Only: This scenario loads only model parameters and does not restore any other state (optimizeizer, epoch, step, or LR). In this scenario, you need to configure LoadCheckpoint in model_convert_pipeline, resume_optimizer=False, and resume_epoch_or_step=False.
Qat_mode is used to set whether to perform the quantization training with BN in the QAT phase. With the help of the FuseBN interface provided by HAT, it can also control whether to perform the training with BN throughout the whole process or with BN being gradually absorbed midway.
The following three settings are available for qat_mode:
QAT Phase without BN (default quantization training method of HAT)
By setting qat_mode to fuse_bn, in the op fusion process of the floating-point model, the weight and bias of BN are absorbed into that of Conv, and the original combination of Conv + BN will be left with only Conv, and this absorption process is theoretically error-free.
QAT Phase with BN
By setting qat_mode to with_bn, when the floating-point model is converted to QAT model, BN is not absorbed into Conv, but exists in the quantized model as a fused quantized op in the QAT phase in the form of Conv + BN + Output Quantized Node. Finally, at the end of quantization training, in the step where the model is converted to quantized (also called int infer), the weight and bias of BN will be automatically absorbed into the quantization parameters of Conv, where the quantized op obtained after the absorption remains consistent with the original QAT op calculation result.
In this mode, the user can also choose to absorb the BN into Conv in the middle of QAT. The reason why the forward results of the QAT model before and after user manually absorbing the BN are inconsistent is that after the BN weight is absorbed into the Conv weight, the quantized parameter conv_weight_scale calculated in the previous quantization training is no longer applicable to the current conv_weight and will lead to large errors in the quantization of conv_weight, which requires more quantization training and more adjustments on quantization parameters.
QAT Phase with BN
The difference between this mode and with_bn is that, in this mode, the BN weight is considered when calculating conv_weight_scale in the quantization training phase before the BN is absorbed (calculations are not detailed here), so that after absorbing the BN weight, the conv_weight_scale is still applicable to the new conv_weight.
This mode is intended to provide a lossless way of absorbing BNs step by step: absorbing BNs in the middle of the quantization training, the forward result of the model is theoretically identical before and after the absorption, and the user can gradually absorb all the BNs in the model before the end of quantization training and ensure that the loss will not fluctuate too much after each absorption.
In this mode, if there are BNs not absorbed at the end of the quantization training, they will be automatically absorbed when the model is converted from QAT to quantized. In theory, such absorption is lossless.
The user only needs to set qat_mode in model_convert_pipeline.
For example:
In both with_bn and with_bn_reverse_fold modes, you can set FuseBN as a callback function to absorb the BN in the specified module at the specified epoch or step.
FuseBN definition:
Use the FuseBN example in the config file:
| QAT Mode | BN Absorbed Time | BN Absorbing Method | Forward Result Changes After Absorption (Theoretically )? |
|---|---|---|---|
| fuse_bn | Must be in the floating-point model op fusion process | Absorbed after executing fuse_module | No Changes |
| with_bn | Can be in the middle of quantized training process | By setting a callback function to absorb in the specified epoch or batch | Yes |
| with_bn | Can be in the conversion process of the model from QAT to quantized | Auto completes with the model conversion | Yes |
| with_bn_reverse_fold | Can be in the middle of quantized training process | By setting a callback function to absorb in the specified epoch or batch | Yes |
| with_bn_reverse_fold | Can be in the conversion process of the model from QAT to quantized | Auto completes with the model conversion | Yes |
In general, a training process starts from the floating-point training, and when the desired accuracy is met, move on to the quantization training, where only fuse_bn is used. Only when the floating-point training is skipped, i.e., it starts with the quantization training, the quantized training mode with BN is needed to ensure the model converges.
The reason why we say "theoretically lossless before and after absorption" or "no change" in this document is that because there is a low probability that the results of the two floating-point calculations before and after the absorption will not match at the later decimal places in the actual calculation. The small variation combined with the quantization operation may result in an absolute error in the output scale of some values of Conv after absorbing BN compared to the output of Conv + BN before absorbing.