QConfig in Detail
Definition of QConfig
The quantization mode of the model is determined by qconfig, which needs to be set for the model before preparing the qat / calibration model.
Attention
Due to historical reasons, there are different definitions and usages of qconfig in the Plugin. Earlier versions of qconfig will be deprecated in the near future, and we only recommend that you use the qconfig usage described in this document.
A qconfig object can set three keywords: input, weight, and output, representing the quantization configuration of the operator's input, weight, and output respectively. When preparing model, these configurations determine whether to insert FakeQuantize or FakeCast nodes at the corresponding positions. None means no nodes will be inserted.
import torch
from horizon_plugin_pytorch.quantization.qconfig import QConfig
from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize
from horizon_plugin_pytorch.quantization.fake_cast import FakeCast
from horizon_plugin_pytorch.quantization.observer_v2 import MinMaxObserver
from horizon_plugin_pytorch.dtype import qint8
qconfig = QConfig(
input=None,
weight=FakeQuantize.with_args(
observer=MinMaxObserver,
dtype=qint8,
qscheme=torch.per_channel_symmetric,
ch_axis=0,
),
output=FakeCast.with_args(dtype=torch.float16),
# activation=xxx Earlier usage, same as the output keyword. Still compatible, but it's recommended to use the output keyword.
)
Definition of FakeQuantize
FakeQuantize is a fake quantization node that performs quantization and dequantization operations on the input. Inserting fake quantization can simulate the errors caused by quantization in the forward pass of a floating-point model. The horizon_plugin_pytorch supports three types of fake quantization: FakeQuantize, PACTFakeQuantize, and _LearnableFakeQuantize. We recommend using the statistic-based FakeQuantize. The document won't introduce PACTFakeQuantize and _LearnableFakeQuantize. If required, please read the papers before using them.
# statistic-based FakeQuantize
from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize
# https://arxiv.org/pdf/1805.06085
from horizon_plugin_pytorch.quantization.pact_fake_quantize import PACTFakeQuantize
# https://arxiv.org/pdf/1902.08153
from horizon_plugin_pytorch.quantization._learnable_fake_quantize import _LearnableFakeQuantize
You can call the with_args method of FakeQuantize to get a constructor and use it to construct qconfig as shown in the previous section. The parameters of with_args include parameters supported by FakeQuantize and observer, theoretically allowing configuration of all parameters declared in the init method of the FakeQuantize and observer classes. However, to avoid unnecessary details, we recommend you to configure the observer-related parameters only.
Different observers have different parameters. Below are examples of constructing FakeQuantize with common used observers. For the specific usage of other observers, see the calibration chapter.
import torch
from horizon_plugin_pytorch.quantization.qconfig import QConfig
from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize
from horizon_plugin_pytorch.quantization.observer_v2 import MinMaxObserver, FixedScaleObserver, MSEObserver
from horizon_plugin_pytorch.dtype import qint8
# The __init__ method of MinMaxObserver includes many parameters. The with_args method can control these parameters.
# We only recommend you to set a few parameters as in the fq_constructor_1 example.
# def __init__(
# self,
# averaging_constant: float = 0.01,
# ch_axis: int = -1,
# dtype: Union[torch.dtype, QuantDType] = qint8,
# qscheme: torch.qscheme = torch.per_tensor_symmetric,
# quant_min: int = None,
# quant_max: int = None,
# is_sync_quantize: bool = False,
# factory_kwargs: Dict = None,
# ) -> None:
fq_constructor_1 = FakeQuantize.with_args(
observer=MinMaxObserver, # Suitable for input/output/weight in qat and weight in calibration.
averaging_constant=0.01, # When performing qat after calibration, the averaging_constant of input/output can be set to 0 to fix the scale.
dtype=qint8, # Quantization type, set based on the support of the operator.
qscheme=torch.per_channel_symmetric, # Only weight supports per-channel quantization.
ch_axis=0, # Specify the channel for per-channel quantization.
)
# Similarly, you can check the __init__ method of FixedScaleObserver and MSEObserver to learn the configurable parameters.
fq_constructor_2 = FakeQuantize.with_args(
observer=FixedScaleObserver, # Fixed scale, will not change in any conditions.
dtype=qint8, # Quantization type, set based on the support of the operator.
scale=INPUT_ABS_MAX / 128, # scale value, use maximum absolute value divided by the maximum quantization type value.
)
fq_constructor_3 = FakeQuantize.with_args(
observer=MSEObserver, # Suitable for input/output in calibration.
dtype=qint8, # Quantization type, set based on the support of the operator.
)
qconfig = QConfig(
weight=fq_constructor_x,
...
)
Definition of FakeCast
FakeCast is a fake conversion node that converts the input to float32 data type. If the data type is float16, it also simulates the truncation error caused by converting value to float16. This node is mainly used to mark operators that require floating-point computation.
The method of using FakeCast to construct qconfig is similar to FakeQuantize, but it only has one parameter.
import torch
from horizon_plugin_pytorch.quantization.qconfig import QConfig
from horizon_plugin_pytorch.quantization.fake_cast import FakeCast
qconfig = QConfig(
input=FakeCast.with_args(dtype=torch.float16), # set based on the support of the operator.
...
)
Construct QConfig
-
Construct the QConfig object directly as introduced above. This method is flexible, allowing the configuration of any configurable parameter, but requires deep understanding of QConfig.
-
Use the get_qconfig interface. This interface is simpler and easier to use than directly constructing QConfig objects but less flexible, and cannot be used for advanced requirements.
import torch
from horizon_plugin_pytorch.quantization import get_qconfig
from horizon_plugin_pytorch.quantization.observer_v2 import MinMaxObserver
from horizon_plugin_pytorch.quantization.qconfig import QConfig
from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize
from horizon_plugin_pytorch.dtype import qint8
# qconfig_1 / qconfig_2 / qconfig_3 / qconfig_4 are equivalent.
qconfig_1 = QConfig(
weight=FakeQuantize.with_args(
observer=MinMaxObserver,
averaging_constant=0.01,
dtype=qint8,
qscheme=torch.per_channel_symmetric,
ch_axis=0,
),
output=FakeQuantize.with_args(
observer=MinMaxObserver,
averaging_constant=0,
dtype=qint8,
qscheme=torch.per_tensor_symmetric,
ch_axis=-1,
),
)
qconfig_2 = QConfig(
weight=FakeQuantize.with_args(
observer=MinMaxObserver,
qscheme=torch.per_channel_symmetric,
ch_axis=0,
),
output=FakeQuantize.with_args(
observer=MinMaxObserver,
averaging_constant=0,
),
)
qconfig_3 = get_qconfig(
observer=MinMaxObserver, # Input and output observer types, only supports MinMaxObserver and MSEObserver in horizon_plugin_pytorch.quantization.observer_v2, default is MinMaxObserver.
in_dtype=None, # Input data type, set based on the support of the operator. None means the input keyword of QConfig is None, default is None.
weight_dtype=qint8, # Weight data type, set based on the support of the operator. None means the weight keyword of QConfig is None, default is qint8.
out_dtype=qint8, # Output data type, set based on the support of the operator. None means the output keyword of QConfig is None, default is qint8.
fix_scale=True, # Whether to fix the input and output scales.
)
qconfig_4 = get_qconfig(fix_scale=True)
Use QConfig
- Directly set the qconfig attribute. This method has the highest priority, and other methods will not override the directly set qconfig.
model.qconfig = QConfig(...)
- QConfig template. Specify the qconfig setter and example_inputs on the prepare interface to automatically set qconfig for the model.
from horizon_plugin_pytorch.quantization import prepare
from horizon_plugin_pytorch.quantization.qconfig_template import (
default_qat_qconfig_setter,
)
qat_model = prepare(
model,
example_inputs=example_inputs,
qconfig_setter=default_qat_qconfig_setter,
)
QConfig Template
QConfig templates get model's graph structure based on subclass trace and automatically set qconfig according to the specified rules. This is the most recommended method for setting qconfig. Usage is as follows:
from horizon_plugin_pytorch.quantization import prepare
from horizon_plugin_pytorch.quantization.qconfig_template import (
default_qat_qconfig_setter,
sensitive_op_qat_8bit_weight_16bit_act_qconfig_setter
)
qat_model = prepare(
model,
example_inputs=example_inputs, # used to get model's graph structure.
qconfig_setter=( # qconfig templates, supports multiple templates with priority from high to low.
sensitive_op_qat_8bit_weight_16bit_act_qconfig_setter(table, ratio=0.2),
default_qat_qconfig_setter,
),
)
Attention
The template priority is lower than directly setting the qconfig attribute of the model. If the model has been configured using model.qconfig = xxx before prepare, the template will not take effect. We do not recommend mixing the two methods unless there is a special need, as this can easily cause errors. In most cases, using one of the two methods is sufficient.
Templates can be divided into three categories:
- Fixed templates. The difference between calibration / qat / qat_fixed_act_scale in fixed templates is the type of observer used and the scale updating logic, which is used for calibration, qat training, and fixed activation scale qat training, respectively.
The default template ( default_calibration_qconfig_setter / default_qat_qconfig_setter / default_qat_fixed_act_qconfig_setter ) does three things:
First, it will set all the high accuracy outputs that can be set, and will give a hint for outputs that don't support high accuracy;
Then, it searches forward from the grid input of the grid sample operator until the first gemm-like operator or QuantStub and sets all the intermediate operators to int16. From experience, the grid here is usually expressed over a wide range, so int8 is more likely to be insufficient to meet the accuracy requirements;
Finally, set the remaining operators to int8. int16 template ( qat_8bit_weight_16bit_act_qconfig_setter / qat_8bit_weight_16bit_fixed_act_qconfig_setter / calibration_8bit_weight_16bit_act_qconfig_setter ) will do two things:
First, set all the high accuracy outputs that can be set, and will give a hint for outputs that don't support high accuracy; Second, set the rest of the operators to int16.
from horizon_plugin_pytorch.quantization.qconfig_template import (
default_calibration_qconfig_setter,
default_qat_qconfig_setter,
default_qat_fixed_act_qconfig_setter,
qat_8bit_weight_16bit_act_qconfig_setter,
qat_8bit_weight_16bit_fixed_act_qconfig_setter,
calibration_8bit_weight_16bit_act_qconfig_setter,
)
- Sensitivity templates. The sensitivity templates are sensitive_op_calibration_8bit_weight_16bit_act_qconfig_setter, sensitive_op_qat_8bit_weight_16bit_act_qconfig_setter and sensitive_op_qat_8bit_weight_16bit_fixed_act_qconfig_setter. The difference among the three is consistent with the difference among the three in the fixed template, which is also used for calibration, qat training, and fixed activation scale qat training, respectively.
The first input of the sensitivity template is the sensitivity result generated by the accuracy debug tool, and the second parameter can be specified as ratio or topk, the sensitivity template will set the topk operators with the highest quantization sensitivity to int16. With the fixed template, it is easy to realize the mixed dtype tuning.
from horizon_plugin_pytorch.quantization.qconfig_template import (
default_calibration_qconfig_setter,
sensitive_op_qat_8bit_weight_16bit_act_qconfig_setter,
sensitive_op_qat_8bit_weight_16bit_fixed_act_qconfig_setter,
sensitive_op_calibration_8bit_weight_16bit_act_qconfig_setter,
)
table = torch.load("output_0-0_dataindex_1_sensitive_ops.pt")
calibration_model = prepare(
model,
example_inputs=example_input,
qconfig_setter=(
sensitive_op_calibration_8bit_weight_16bit_act_qconfig_setter(table, ratio=0.2),
default_calibration_qconfig_setter,
),
)
- Customized templates. Customized template only has ModuleNameQconfigSetter, you need to pass in the module name and the corresponding qconfig dictionary. It is generally used for setting fixed scale and other special needs. Also, it can be used with fixed templates and sensitivity templates.
from horizon_plugin_pytorch.quantization.qconfig_template import (
default_qat_qconfig_setter,
sensitive_op_qat_8bit_weight_16bit_fixed_act_qconfig_setter,
ModuleNameQconfigSetter,
)
table = torch.load("output_0-0_dataindex_1_sensitive_ops.pt")
module_name_to_qconfig = {
"op_1": get_qconfig(),
"op_2": QConfig(
output=FakeQuantize.with_args(
observer=FixedScaleObserver,
dtype=qint16,
scale=OP2_MAX/QINT16_MAX,
)
),
}
qat_model = prepare(
model,
example_inputs=example_input,
qconfig_setter=(
ModuleNameQconfigSetter(module_name_to_qconfig),
sensitive_op_qat_8bit_weight_16bit_fixed_act_qconfig_setter(table, ratio=0.2),
default_qat_qconfig_setter,
),
)