The quantized awareness training is performed by inserting some pseudo-quantized nodes into the model, so as to minimize the loss of accuracy when the model obtained through quantized awareness training is converted into a fixed-point model. The quantized awareness training is no different from traditional model training in that one can start from scratch, build a pseudo-quantized model, and then train on that pseudo-quantized model. Due to the limitations of the deployed hardware platform, it is challenging to understand these limitations and build a pseudo-quantization model based on them. The quantized awareness training tool reduces the challenges of developing quantized models by automatically inserting pseudo-quantization operators into the provided floating-point model based on the limitations of the deployment platform.
The quantized awareness training is generally more difficult than the training of pure floating-point models due to the various restrictions imposed. The goal of the quantized awareness training tool is to reduce the difficulty of quantized awareness training and to reduce the engineering difficulty of quantized model deployment.
Although it is not mandatory for a quantized awareness training tool to start with a pre-trained floating-point model, experience has shown that typically starting quantized awareness training with a pre-trained high-precision floating-point model can greatly reduce the difficulty of quantized awareness training.
Due to the underlying limitations of the deployment platform, the QAT model cannot fully represent the final on-board accuracy, please make sure to monitor the quantized model accuracy to ensure that the quantized model accuracy is normal, otherwise the model on-board dropout problem may occur.
As can be seen from the above sample code, there are two additional steps in quantized awareness training compared to traditional pure floating-point model training:
The goal of this step is to transform the floating-point network and insert pseudo-quantized nodes.
A better initialization is obtained by loading the pseudo-quantization parameters obtained from Calibration.
At this point, the construction of the pseudo-quantized model and the initialization of the parameters are completed, and then the regular training iterations and model parameter updates can be performed, and the quantized model accuracy can be monitored.
The main difference between the quantized awareness training and the traditional floating-point model's training is the insertion of pseudo-quantization operators, and, as different quantized awareness training algorithms are also represented by pseudo-quantization operators, here we introduce the pseudo-quantization operators.
Since the BPU only supports symmetric quantization, here we take the symmetric quantization as an example.
Take the int8 quantized awareness training as an example, in general, the pseudo-quantization operator is computed as follows:
fake_quant_x = clip(round(x / scale),-128, 127) * scale
Similar to Conv2d, which optimizes the weight and bias parameters through training, the pseudo-quantization operator needs to be trained to optimize the scale parameter. However, the gradient of round as a step function is 0, which makes it impossible to train the pseudo-quantization operator by backpropagation of the gradient directly. To solve this problem, there are usually two solutions: a statistical-based approach and a learning-based approach.
The goal of quantization is to uniformly map the floating point numbers in Tensor to the range [-128, 127] represented by int8 via the scale parameter. Since the mapping is uniform, it is easy to see how scale is calculated:
Due to the uneven distribution of data in Tensor and the outlier problem, different methods for calculating xmin and xmax have been developed. See MovingAverageMinMaxObserver and so on.
Please refer to default_qat_8bit_fake_quant_qconfig and its related interfaces for the usage in the tool.
Although the gradient of round is 0, the researcher found experimentally that in this scenario, if the gradient is directly set to 1, the model can also be made to converge to the expected accuracy.
Please refer to default_qat_8bit_lsq_quant_qconfig and its related interfaces for the usage in the tool.
If you are interested in learning more, you can refer to the paper Learned Step Size Quantization.