According to whether to adjust the parameters after quantization, we can classify the quantization methods into Quantization-aware Training (QAT) and Post-training Quantization (PTQ).
The difference in operations between these two methods is shown in the following diagram (Left: PTQ; Right: QAT).

The PTQ uses a batch of calibration data to calibrate the trained model, converting the trained FP32 model directly into a fixed-point computational model without any training of the original model. Only a few hyperparameters are adjusted to complete the quantization process, and the process is simple and fast, no training is required, so this method has been widely used in a large number of end-side and cloud-side deployment scenarios, we recommend trying the PTQ method first to see if it meets your deployment accuracy and performance requirements .
The QAT is to quantize the trained model and then retrain it again. Since fixed-point values cannot be used for backward gradient calculation, the actual procedure is to insert fake quantization nodes in front of some ops to obtain the truncated values of the data flowing through this op during training, which can be easily used when quantizing the nodes during the deployment of the quantization model. We need to obtain the best quantization parameters by continuously optimizing the accuracy during the training. As model training is involved, it requires higher level of skills for the developers.