Calibration Tutorial

In quantization, an important step is to determine the quantized parameters, a reasonable initial quantized parameter can significantly improve the accuracy of the model and speed up the convergence of the model. Calibration is the process of inserting an Observer into the floating-point model, using a small amount of training data, and counting the distribution of the data at various points during the forward process of the model to determine a reasonable quantized parameter. Although it is possible to do quantized awareness training without Calibration, but in general it is beneficial and not detrimental to quantized awareness training, so it is recommended that you make this step a required option.

Process and Example

The overall flow of Calibration and QAT is shown below:

calibration_v2_workflow

Following is a description of each step:

  1. Build and train the floating-point model. Refer to the section Obtain the Floating-point Model in the quick start section of horizon_plugin_pytorch.

  2. Insert the Observer node into the floating-point model. Refer to the section Calibration in the quick start section of horizon_plugin_pytorch. Before converting the floating-point model using the prepare method, you need to set qconfig for the model.

    model.qconfig = horizon.quantization.get_default_qconfig()

    The get_default_qconfig can set different observers for weight and activation. Currently, the optional observers for calibration are min_max, percentile, mse, kl and mix. If there is no special need, weight_observer is recommended to use the default "min_max" and activation_observer is recommended to use "mse". Special usage and debugging tips are described in the common algorithms introduction below.

    The fake_quant parameter has no effect on Calibration results, just leave it as default.

    def get_default_qconfig( activation_fake_quant: Optional[str] = "fake_quant", weight_fake_quant: Optional[str] = "fake_quant", activation_observer: Optional[str] = "min_max", weight_observer: Optional[str] = "min_max", activation_qkwargs: Optional[Dict] = None, weight_qkwargs: Optional[Dict] = None, ):
  3. Set fake quantize state to CALIBRATION.

    horizon.quantization.set_fake_quantize(model, horizon.quantization.FakeQuantState.CALIBRATION)

    There are three states of fake quantize, which need to be set to the corresponding state of fake quantize of the model before QAT, calibration, and validation, respectively. In the calibration state, only the statistics of the inputs and outputs of each operator are observed. In the QAT state, pseudo-quantization is performed in addition to observing the statistics. In validation state, no statistics are observed and only pseudo-quantization is performed.

    class FakeQuantState(Enum): QAT = "qat" CALIBRATION = "calibration" VALIDATION = "validation"
  4. Perform calibration. Feed the prepared calibration data to the model, and the model will be observed by the observer during the forward process to observe the relevant statistics.

  5. Set the model state to eval and set the fake quantize state to VALIDATION.

    model.eval() horizon.quantization.set_fake_quantize(model, horizon.quantization.FakeQuantState.VALIDATION)
  6. Verify the effect of calibration. If you are satisfied with the result, you can directly convert the model to fixed-point or perform quantized awareness training based on it, if not, you can adjust the parameters in calibration qconfig to continue calibration.

Common Algorithms Introduction

Note

Reference the API documentation at the end of this section for a description of the parameters of each operator.

AlgorithmSpeed RankAccuracy RankEasy-to-use Rank
min_max151
percentile244
mse412
kl523
mix321

The performance of several popular calibration methods is shown in the table above, where smaller numbers are better, speed indicates the time taken to calibrate the same data, accuracy indicates how well the method calibrates on most models, and easy-to-use indicates the complexity of the method's tuning parameters.

For a same model, the accuracy/speed of different methods with different parameters can be quite different, and some recent research work has shown that no one method can achieve the best accuracy on all models, and that its parameters need to be adjusted specifically. So it is recommended that you try all of these calibration methods.

  1. min_max. This method only counts the sliding average of the maximum and minimum values, and is used for quickly determining general parameters such as batch size, average_constant, and so on.

  2. percentile. This method has the highest accuracy upper limit among all the methods, but it is also the most troublesome to adjust, so if the accuracy requirement can be satisfied by other methods or the default parameters of this method, then it is not recommended to spend too much time on adjusting the parameters. The percentile can be adjusted by two parameters, bins and percentile. The more bins there are, the smaller the interval between the candidates of max, the finer the granularity of tuning, but it also means higher computation time. It is recommended to determine the percentile first and then adjust the bins, alternating between the two iterations to narrow down the tuning range until a satisfactory result is achieved. In most cases, the bins of 2048 provides enough granularity for tuning, so there is no need to adjust this parameter separately. The following is the tuning path of a model:

    OrderpercentilebinsAccuracy
    199.99204853.75
    299.99409654.38
    399.995409616.25
    499.985409632.67
    599.9875409657.06
    699.9875819262.84
    799.98875819257.62
    899.988125819263.15

    In this example, it can be seen that the accuracy is improved by about 10% after careful adjustment. There is a big difference between the inputs and outputs of different ops in the model, a set of global percentile parameters may be hard to satisfy the needs of all ops, if you need higher accuracy, you can find a better global parameter by the above method, and then use debug tool to find a few ops with bigger error, and then set up the percentile parameters of these ops individually. Refer to the qconfig setting. Below is a list of some common data distributions that are prone to large errors:

    calibration_percentile_longtail

    For ultra-long-tailed distributions, the percentile should be set to a smaller value, 99.9 in the picture is a better value.

    calibration_percentile_bimodal

    The value domain is too large, and the distribution is not concentrated in one place, this situation either retain the tail or ignore the tail will bring a large loss of accuracy, should be adjusted in the training of floating-point model by adjusting the weight decay and other parameters to avoid this situation.

    calibration_percentile_ln

    The layernorm output distribution will show a number of very high concentration of the region, at this time the percentile adjusted in accordance with the normal method will not have any effect on the quantization results, you need to increase the percentile adjustment amplitude.

  3. mse. The only parameter that can be adjusted is stride, the default stride is 1, it will gradually try the 100th quantile of the maximum value and select the value corresponding to the quantile with the smallest error (L2 distance) before and after the quantization and dequantization. This method is time-consuming for large models, if you increase stride within a reasonable range, you can reduce the time-consumption under the premise of guaranteeing the accuracy, but if stride is adjusted too large, the accuracy will be affected. Note that adjusting the parameters of this method can only optimize the time consumption, but not improve the accuracy significantly.

  4. kl. There are two adjustable parameters, bin and update_interval, it is not recommended to adjust the default bin because this method is too time consuming. The update_interval is 1 by default, it can be adjusted to reduce the time consuming, but you need to make sure that the update_interval is smaller than the total calibration step, otherwise you can not get a normal quantization parameter.

  5. mix.This method is a hybrid calibration, and for each place where statistics are needed, different parameters of the percentile method are tried, and the method with the smallest error (L2 distance) before and after the quantization and dequantization. There is a high degree of automation and no parameters that need to be adjusted.

Tuning Technique

  1. When calibration is performed, the more data the better. However, because of the marginal effect, when the amount of data reaches a certain level, the improvement of accuracy will be very limited. If your training set is small, you can use all of it for calibration. If your training set is large, you can select a subset of the right size in combination with the calibration time consumed, and it is recommended to calibrate at least 10 - 100 steps.

  2. The data can be flipped horizontally augmentation, do not do mosaic augmentation, try to use the pre-processing + training data of the infer stage for calibration.

  3. The Batch size should be as large as possible, but can be reduced if the data are noisy or if there are many outliers in the model. This parameter should be determined when trying the min max method.

  4. The average_constant indicates the effect of each step on the maximum and minimum values, the smaller average_constant is, the smaller the effect of the current step is, and the larger the effect of the historical sliding average is. This parameter needs to be adjusted between 0.01 ~ 0.5 with the amount of data. When there is enough data (step > 100), average_constant can be 0.01, when there is not enough data, average_constant can be increased as appropriate, in extreme case, there are only 2 steps of data, average_constant can be 0.5. This parameter should be determined when trying the min max method, and all other methods will follow this parameter after that.

  5. When the accuracy of calibration model is good, fixing the quantization parameter of the feature map for QAT training can achieve better results, while when the accuracy is poor, the quantization parameter obtained from calibration cannot be fixed. There is no clear standard about whether the accuracy is good or bad, we need to try. For example, if the accuracy of a certain model is 100, and if the accuracy of calibration is 50, then the accuracy is not good, but if the accuracy of calibration is 95, then whether the accuracy can reach the degree of fixing the quantization parameters of feature map needs to be tried, and the usual practice is to do experiments to compare the fixing and not fixing.

  6. Priority to try min max method, the method is the fastest, used to run through the calibration process, adjust and determine the batch size and average_constant two parameters, and then try percentile, kl, mse and mix four methods and select the most effective method.

Observer Parameters

class horizon_plugin_pytorch.quantization.observer_v2.KLObserver(bins: int = 512, update_interval: int = 1, averaging_constant: float = 0.01, ch_axis: int = -1, dtype: dtype | QuantDType = 'qint8', qscheme: qscheme = torch.per_tensor_symmetric, quant_min: int = None, quant_max: int = None, is_sync_quantize: bool = False, factory_kwargs: Dict = None)

KL observer.

KL observer based on histogram. Histogram is calculated online and won’t be saved.

  • Parameters:
    • bins – Number of histograms bins.
    • update_interval – Interval of computing KL entropy and update min/max. KLObserver will constantly collect histograms of activations, but only perform KL calculation when update_interval is satisfied. if it is set to 1, KL entropy will be computed every forward step. Larger interval guarantees less time and does no harm to calibration accuracy. Set it to the total calibration steps can achieve best performance. update_interval must be no greater than total calibration steps, otherwise no min/max will be computed.
    • averaging_constant – Averaging constant for min/max.
    • ch_axis – Channel axis.
    • dtype – Quantized data type.
    • qscheme – Quantization scheme to be used.
    • quant_min – Min quantization value. Will follow dtype if unspecified.
    • quant_max – Max quantization value. Will follow dtype if unspecified.
    • is_sync_quantize – If sync statistics when training with multiple devices.
    • factory_kwargs – kwargs which are passed to factory functions for min_val and max_val.

forward(x_orig)

Defines the computation performed at every call.

Should be overridden by all subclasses.

NOTE

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class horizon_plugin_pytorch.quantization.observer_v2.MSEObserver(stride: int = 1, averaging_constant: float = 0.01, ch_axis: int = -1, dtype: dtype | QuantDType = 'qint8', qscheme: qscheme = torch.per_tensor_symmetric, quant_min: int = None, quant_max: int = None, is_sync_quantize: bool = False, factory_kwargs: Dict = None)

MSE observer.

Observer module for computing the quantization parameters based on the Mean Square Error (MSE) between the original tensor and the quantized one.

This observer linear searches the quantization scales that minimize MSE.

  • Parameters:
    • stride – Searching stride. Larger value gives smaller search space, which means less computing time but possibly poorer accuracy. Default is 1. Suggests no greater than 20.
    • averaging_constant – Averaging constant for min/max.
    • ch_axis – Channel axis.
    • dtype – Quantized data type.
    • qscheme – Quantization scheme to be used.
    • quant_min – Min quantization value. Will follow dtype if unspecified.
    • quant_max – Max quantization value. Will follow dtype if unspecified.
    • is_sync_quantize – If sync statistics when training with multiple devices.
    • factory_kwargs – kwargs which are passed to factory functions for min_val and max_val.

forward(x_orig)

Defines the computation performed at every call.

Should be overridden by all subclasses.

NOTE

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class horizon_plugin_pytorch.quantization.observer_v2.MinMaxObserver(averaging_constant: float = 0.01, ch_axis: int = -1, dtype: dtype | QuantDType = 'qint8', qscheme: qscheme = torch.per_tensor_symmetric, quant_min: int = None, quant_max: int = None, is_sync_quantize: bool = False, factory_kwargs: Dict = None)

Min max observer.

This observer computes the quantization parameters based on minimums and maximums of the incoming tensors. The module records the moving average minimum and maximum of incoming tensors, and uses this statistic to compute the quantization parameters.

  • Parameters:
    • averaging_constant – Averaging constant for min/max.
    • ch_axis – Channel axis.
    • dtype – Quantized data type.
    • qscheme – Quantization scheme to be used.
    • quant_min – Min quantization value. Will follow dtype if unspecified.
    • quant_max – Max quantization value. Will follow dtype if unspecified.
    • is_sync_quantize – If sync statistics when training with multiple devices.
    • factory_kwargs – kwargs which are passed to factory functions for min_val and max_val.

forward(x_orig)

Record the running minimum and maximum of x.

class horizon_plugin_pytorch.quantization.observer_v2.MixObserver(averaging_constant: float = 0.01, ch_axis: int = -1, dtype: dtype | QuantDType = 'qint8', qscheme: qscheme = torch.per_tensor_symmetric, quant_min: int = None, quant_max: int = None, is_sync_quantize: bool = False, factory_kwargs: Dict = None)

Mix observer.

This observer computes the quantization parameters based on multiple calibration methods and selects the quantization parameters with the smallest quantization error.

  • Parameters:
    • averaging_constant – Averaging constant for min/max.
    • ch_axis – Channel axis.
    • dtype – Quantized data type.
    • qscheme – Quantization scheme to be used.
    • quant_min – Min quantization value. Will follow dtype if unspecified.
    • quant_max – Max quantization value. Will follow dtype if unspecified.
    • is_sync_quantize – If sync statistics when training with multiple devices.
    • factory_kwargs – kwargs which are passed to factory functions for min_val and max_val.

forward(x_orig)

Defines the computation performed at every call.

Should be overridden by all subclasses.

NOTE

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class horizon_plugin_pytorch.quantization.observer_v2.PercentileObserver(percentile: float = 99.99, bins: int = 2048, averaging_constant: float = 0.01, ch_axis: int = -1, dtype: dtype | QuantDType = 'qint8', qscheme: qscheme = torch.per_tensor_symmetric, quant_min: int = None, quant_max: int = None, is_sync_quantize: bool = False, factory_kwargs: Dict = None)

Percentile observer.

Percentile observer based on histogram. Histogram is calculated online and won’t be saved. The minimum and maximum are moving averaged to compute the quantization parameters.

  • Parameters:
    • percentile – Index percentile of histrogram
    • bins – Number of histograms bins.
    • averaging_constant – Averaging constant for min/max.
    • ch_axis – Channel axis.
    • dtype – Quantized data type.
    • qscheme – Quantization scheme to be used.
    • quant_min – Min quantization value. Will follow dtype if unspecified.
    • quant_max – Max quantization value. Will follow dtype if unspecified.
    • is_sync_quantize – If sync statistics when training with multiple devices.
    • factory_kwargs – kwargs which are passed to factory functions for min_val and max_val.

forward(x_orig)

Defines the computation performed at every call.

Should be overridden by all subclasses.

NOTE

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class horizon_plugin_pytorch.quantization.MovingAverageMinMaxObserver(averaging_constant=0.01, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=None, quant_max=None, is_sync_quantize=False, factory_kwargs=None)

MovingAverageMinMax Observer.

Observer module for computing the quantization parameters based on the moving average of the min and max values.

This observer computes the quantization parameters based on the moving averages of minimums and maximums of the incoming tensors. The module records the average minimum and maximum of incoming tensors, and uses this statistic to compute the quantization parameters.

  • Parameters:
    • averaging_constant – Averaging constant for min/max.
    • dtype – Quantized data type
    • qscheme – Quantization scheme to be used, only support per_tensor_symmetric scheme
    • reduce_range – Reduces the range of the quantized data type by 1 bit
    • quant_min – Minimum quantization value.
    • quant_max – Maximum quantization value.
    • is_sync_quantize – Whether use sync quantize
    • factory_kwargs – Arguments for register data buffer

forward(x_orig)

Record the running minimum and maximum of x.

class horizon_plugin_pytorch.quantization.MovingAveragePerChannelMinMaxObserver(averaging_constant=0.01, ch_axis=0, dtype=torch.qint8, qscheme=torch.per_channel_symmetric, quant_min=None, quant_max=None, is_sync_quantize=False, factory_kwargs=None)

MovingAveragePerChannelMinMax Observer.

Observer module for computing the quantization parameters based on the running per channel min and max values.

This observer uses the tensor min/max statistics to compute the per channel quantization parameters. The module records the running minimum and maximum of incoming tensors, and uses this statistic to compute the quantization parameters.

  • Parameters:
    • averaging_constant – Averaging constant for min/max.
    • ch_axis – Channel axis
    • dtype – Quantized data type
    • qscheme – Quantization scheme to be used, Only support per_channel_symmetric
    • quant_min – Minimum quantization value.
    • quant_max – Maximum quantization value.
    • is_sync_quantize – whether use sync quantize
    • factory_kwargs – Arguments for register data buffer

forward(x_orig)

Defines the computation performed at every call.

Should be overridden by all subclasses.

NOTE

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

On This Page