Similar to PyTorch's quantized awareness training, our horizon_plugin_pytorch is designed and developed based on fx, and as such, requires that the floating-point model be one that can correctly complete the symbolic_trace.
Since the BPU only supports partial operators, horizon_plugin_pytorch only supports operators in the operator list and special operators that are specifically defined internally based on BPU limitations.
The process of changing a floating-point model into a fixed-point model has a certain accuracy error. The more quantization-friendly the floating-point model is, the easier it is to improve the qat accuracy, and the higher the accuracy after quantization. In general, there are several situations that can cause a model to become quantization-unfriendly:
Use operators with accuracy risk. For example: softmax, layernorm, etc. (see op document), these operators are usually realized by table lookup or by multiple op splicing at the bottom, and are prone to dropout problems.
Call the same operator multiple times in a forward. If the same operator is called multiple times, the corresponding output distribution is different, but only a set of quantization parameters will be counted, when the output distribution of multiple calls is too different, the quantization error will become larger.
There are too many differences in the inputs of add, cat and other multi-input operators, which may cause large errors.
The data distribution is not reasonable. plugin adopts uniform symmetric quantization, so the uniform distribution of 0 mean is the best, and long tail and outliers should be avoided as much as possible. At the same time, the value range needs to match the quantization bit, if you use int8 to quantize the data distributed uniformly as [-1000, 1000], then the accuracy is obviously not enough. For example, the following three distributions, from left to right, the friendliness of quantization decreases, the distribution of most of the values in the model should be the middle of this distribution. In practice, you can use the debug tool to see if the distributions of the model weight and feature map are quantization-friendly. Because of the redundancy of the model, some ops that seem to have very quantization-unfriendly distributions will not significantly reduce the final accuracy of the model, and need to be considered in conjunction with the actual difficulty of qat training and the final quantization accuracy achieved.

So how can we make the model more quantization-friendly? Specifically:
Minimize the use of operators with excessive precision risk, see the op documentation for details.
Ensure that the output distribution of multiple calls to the shared operator does not vary too much, or split the shared operator to use separately.
Avoid large differences in the range of values of different inputs of the multi-input operator.
Use int16 to quantize ops with very large value ranges and errors. You can find such ops with the debug tool.
Prevent the model from overfitting by increasing the weight decay and adding data enhancement. Overfitting models are prone to large values and are very sensitive to inputs, so a small error can cause the output to be completely wrong.
Use BN.
Normalize the model inputs with respect to zero symmetry.
Note that qat itself has some adjustment ability, quantization unfriendly doesn't mean it can't be quantized, in many cases, even if the above unsuitable quantization phenomenon occurs, it can still be quantized well. Since the above suggestions may also lead to a degradation of the floating-point model accuracy, they should be attempted when the qat accuracy is not achievable, especially suggestions 1 - 5, and in the end you should find a balance between the floating-point model accuracy and the quantized model accuracy.