In quantization, an important step is to determine the quantized parameters, a reasonable initial quantized parameter can significantly improve the accuracy of the model and speed up the convergence of the model. Calibration is the process of inserting an Observer into the floating-point model, using a small amount of training data, and counting the distribution of the data at various points during the forward process of the model to determine a reasonable quantized parameter. Although it is possible to do quantized awareness training without Calibration, but in general it is beneficial and not detrimental to quantized awareness training, so it is recommended that you make this step a required option.
The overall flow of Calibration and QAT is shown below:
Following is a description of each step:
Build and train the floating-point model. Refer to the section Obtain the Floating-point Model in the quick start section of horizon_plugin_pytorch.
Insert the Observer node into the floating-point model. Refer to the section Calibration in the quick start section of horizon_plugin_pytorch.
Before converting the floating-point model using the prepare method, you need to set qconfig for the model.
The get_default_qconfig can set different observers for weight and activation. Currently, the optional observers for calibration are min_max, percentile, mse, kl and mix.
If there is no special need, weight_observer is recommended to use the default "min_max" and activation_observer is recommended to use "mse". Special usage and debugging tips are described in the common algorithms introduction below.
The fake_quant parameter has no effect on Calibration results, just leave it as default.
Set fake quantize state to CALIBRATION.
There are three states of fake quantize, which need to be set to the corresponding state of fake quantize of the model before QAT, calibration, and validation, respectively.
In the calibration state, only the statistics of the inputs and outputs of each operator are observed. In the QAT state, pseudo-quantization is performed in addition to observing the statistics.
In validation state, no statistics are observed and only pseudo-quantization is performed.
Perform calibration. Feed the prepared calibration data to the model, and the model will be observed by the observer during the forward process to observe the relevant statistics.
Set the model state to eval and set the fake quantize state to VALIDATION.
Verify the effect of calibration. If you are satisfied with the result, you can directly convert the model to fixed-point or perform quantized awareness training based on it, if not, you can adjust the parameters in calibration qconfig to continue calibration.
Reference the API documentation at the end of this section for a description of the parameters of each operator.
| Algorithm | Speed Rank | Accuracy Rank | Easy-to-use Rank |
|---|---|---|---|
| min_max | 1 | 5 | 1 |
| percentile | 2 | 4 | 4 |
| mse | 4 | 1 | 2 |
| kl | 5 | 2 | 3 |
| mix | 3 | 2 | 1 |
The performance of several popular calibration methods is shown in the table above, where smaller numbers are better, speed indicates the time taken to calibrate the same data, accuracy indicates how well the method calibrates on most models, and easy-to-use indicates the complexity of the method's tuning parameters.
For a same model, the accuracy/speed of different methods with different parameters can be quite different, and some recent research work has shown that no one method can achieve the best accuracy on all models, and that its parameters need to be adjusted specifically. So it is recommended that you try all of these calibration methods.
min_max. This method only counts the sliding average of the maximum and minimum values, and is used for quickly determining general parameters such as batch size, average_constant, and so on.
percentile. This method has the highest accuracy upper limit among all the methods, but it is also the most troublesome to adjust, so if the accuracy requirement can be satisfied by other methods or the default parameters of this method, then it is not recommended to spend too much time on adjusting the parameters. The percentile can be adjusted by two parameters, bins and percentile. The more bins there are, the smaller the interval between the candidates of max, the finer the granularity of tuning, but it also means higher computation time. It is recommended to determine the percentile first and then adjust the bins, alternating between the two iterations to narrow down the tuning range until a satisfactory result is achieved. In most cases, the bins of 2048 provides enough granularity for tuning, so there is no need to adjust this parameter separately. The following is the tuning path of a model:
| Order | percentile | bins | Accuracy |
|---|---|---|---|
| 1 | 99.99 | 2048 | 53.75 |
| 2 | 99.99 | 4096 | 54.38 |
| 3 | 99.995 | 4096 | 16.25 |
| 4 | 99.985 | 4096 | 32.67 |
| 5 | 99.9875 | 4096 | 57.06 |
| 6 | 99.9875 | 8192 | 62.84 |
| 7 | 99.98875 | 8192 | 57.62 |
| 8 | 99.988125 | 8192 | 63.15 |
In this example, it can be seen that the accuracy is improved by about 10% after careful adjustment. There is a big difference between the inputs and outputs of different ops in the model, a set of global percentile parameters may be hard to satisfy the needs of all ops, if you need higher accuracy, you can find a better global parameter by the above method, and then use debug tool to find a few ops with bigger error, and then set up the percentile parameters of these ops individually. Refer to the qconfig setting. Below is a list of some common data distributions that are prone to large errors:

For ultra-long-tailed distributions, the percentile should be set to a smaller value, 99.9 in the picture is a better value.

The value domain is too large, and the distribution is not concentrated in one place, this situation either retain the tail or ignore the tail will bring a large loss of accuracy, should be adjusted in the training of floating-point model by adjusting the weight decay and other parameters to avoid this situation.

The layernorm output distribution will show a number of very high concentration of the region, at this time the percentile adjusted in accordance with the normal method will not have any effect on the quantization results, you need to increase the percentile adjustment amplitude.
mse. The only parameter that can be adjusted is stride, the default stride is 1, it will gradually try the 100th quantile of the maximum value and select the value corresponding to the quantile with the smallest error (L2 distance) before and after the quantization and dequantization. This method is time-consuming for large models, if you increase stride within a reasonable range, you can reduce the time-consumption under the premise of guaranteeing the accuracy, but if stride is adjusted too large, the accuracy will be affected. Note that adjusting the parameters of this method can only optimize the time consumption, but not improve the accuracy significantly.
kl. There are two adjustable parameters, bin and update_interval, it is not recommended to adjust the default bin because this method is too time consuming. The update_interval is 1 by default, it can be adjusted to reduce the time consuming, but you need to make sure that the update_interval is smaller than the total calibration step, otherwise you can not get a normal quantization parameter.
mix.This method is a hybrid calibration, and for each place where statistics are needed, different parameters of the percentile method are tried, and the method with the smallest error (L2 distance) before and after the quantization and dequantization. There is a high degree of automation and no parameters that need to be adjusted.
When calibration is performed, the more data the better. However, because of the marginal effect, when the amount of data reaches a certain level, the improvement of accuracy will be very limited. If your training set is small, you can use all of it for calibration. If your training set is large, you can select a subset of the right size in combination with the calibration time consumed, and it is recommended to calibrate at least 10 - 100 steps.
The data can be flipped horizontally augmentation, do not do mosaic augmentation, try to use the pre-processing + training data of the infer stage for calibration.
The Batch size should be as large as possible, but can be reduced if the data are noisy or if there are many outliers in the model. This parameter should be determined when trying the min max method.
The average_constant indicates the effect of each step on the maximum and minimum values, the smaller average_constant is, the smaller the effect of the current step is, and the larger the effect of the historical sliding average is. This parameter needs to be adjusted between 0.01 ~ 0.5 with the amount of data. When there is enough data (step > 100), average_constant can be 0.01, when there is not enough data, average_constant can be increased as appropriate, in extreme case, there are only 2 steps of data, average_constant can be 0.5. This parameter should be determined when trying the min max method, and all other methods will follow this parameter after that.
When the accuracy of calibration model is good, fixing the quantization parameter of the feature map for QAT training can achieve better results, while when the accuracy is poor, the quantization parameter obtained from calibration cannot be fixed. There is no clear standard about whether the accuracy is good or bad, we need to try. For example, if the accuracy of a certain model is 100, and if the accuracy of calibration is 50, then the accuracy is not good, but if the accuracy of calibration is 95, then whether the accuracy can reach the degree of fixing the quantization parameters of feature map needs to be tried, and the usual practice is to do experiments to compare the fixing and not fixing.
Priority to try min max method, the method is the fastest, used to run through the calibration process, adjust and determine the batch size and average_constant two parameters, and then try percentile, kl, mse and mix four methods and select the most effective method.
KL observer.
KL observer based on histogram. Histogram is calculated online and won’t be saved.
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
MSE observer.
Observer module for computing the quantization parameters based on the Mean Square Error (MSE) between the original tensor and the quantized one.
This observer linear searches the quantization scales that minimize MSE.
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Min max observer.
This observer computes the quantization parameters based on minimums and maximums of the incoming tensors. The module records the moving average minimum and maximum of incoming tensors, and uses this statistic to compute the quantization parameters.
Record the running minimum and maximum of x.
Mix observer.
This observer computes the quantization parameters based on multiple calibration methods and selects the quantization parameters with the smallest quantization error.
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Percentile observer.
Percentile observer based on histogram. Histogram is calculated online and won’t be saved. The minimum and maximum are moving averaged to compute the quantization parameters.
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
MovingAverageMinMax Observer.
Observer module for computing the quantization parameters based on the moving average of the min and max values.
This observer computes the quantization parameters based on the moving averages of minimums and maximums of the incoming tensors. The module records the average minimum and maximum of incoming tensors, and uses this statistic to compute the quantization parameters.
Record the running minimum and maximum of x.
MovingAveragePerChannelMinMax Observer.
Observer module for computing the quantization parameters based on the running per channel min and max values.
This observer uses the tensor min/max statistics to compute the per channel quantization parameters. The module records the running minimum and maximum of incoming tensors, and uses this statistic to compute the quantization parameters.
Defines the computation performed at every call.
Should be overridden by all subclasses.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.