Horizon J6 algorithm toolchain (hereinafter referred to as the toolchain) is a complete set of artificial intelligence edge algorithm solution, which can help you quantify floating-point models into fixed-point models and rapidly deploy self-developed algorithm models on Horizon computing platforms.
Currently, most of the models trained on GPUs are floating-point models, that is, the parameters are stored in the float data type.
The computing platforms with Horizon's BPU architecture use the int8 calculation precision (common precision for computing platforms in the industry) and is capable of running fixed-point quantization models.
The quantization is the process of converting a trained floating-point model to a fixed-point model.
In addition, model quantization can effectively reduce the model size, accelerate the speed of deep-learning inference, therefore, it is also widely studied and applied in academia and industry.
Depending on whether to adjust the parameters after quantization, we can classify the quantization methods into post-training quantization (PTQ) and quantization-aware training (QAT).
The difference in operation between these two methods is shown in the following diagram (Left: PTQ; Right: QAT).

The PTQ uses a batch of calibration data to calibrate the trained models, which converts the trained FP32 model directly into a fixed-point computational model without any training of the original model. As the quantization process is simple and fast, requiring a few adjustments to the hyperparameters and no training, this method has been widely used in a large number of end-side and cloud-side deployment scenarios. We recommend that you to try the PTQ method first to see if it meets your requirements on the deployment accuracy and performance. For more information about the PTQ scheme, please read Post-training Quantization (PTQ).
The QAT is to quantize the trained model before training it again. Since the fixed-point values cannot be used for backward gradient calculation, the actual procedure is to insert fake quantization nodes in front of some operators to obtain the truncated values of the data flowing through the op during the training, so that they can be easily used when quantizing the nodes during the deployment of the quantization models. We need to continuously optimize the accuracy during training to obtain the best quantization parameters. Since the model training is involved, it requires the developers to have higher levels of technical skills. For more information about the QAT scheme, please read Quantized Awareness Training (QAT).
The toolchain consists of PTQ, QAT, embedded compilation, etc. Runtime SDK provides runtime library support for heterogeneous models. The runtime library contains two parts, ARM and x86, which are used to execute heterogeneous models on Horizon computing platform and x86 simulation platform respectively.
For more information about embedded application development, please read Embedded Application Development.
In addition, the toolchain provides rich development tools, samples, and model releases with a large number of built-in algorithmic models to help you get started and to improve your development efficiency.
The overall workflow of using the toolchain is shown in the figure below. We recommend that you to try the PTQ method first to see if it meets your requirements on the deployment accuracy and performance.
