This section provides you with some concepts that may appear frequently within the following as well as some commonly used background knowledge.
Original floating-point model
Available models obtained from the DL framework training like TensorFlow, PyTorch, etc. This model is computed with a precision of float32.
Hybrid Heterogeneous Model
A model format suitable for running on the Horizon computing platform. It is called heterogeneous model because it can support model execution on both ARM CPU and BPU. Since the operation speed on the BPU will be much faster than that on the CPU, the operators will be computed on the BPU as much as possible. For operators that are not supported on the BPU at the moment, they will be computed on the CPU.
Operator
Deep learning algorithm are composed of computational units, we call these computational units as the Operator (also known as op). The operator is a mapping from a function space onto a function space, the name of the operator is unique in the same model, but more than one operator of the same type can exist. For example, Conv1, Conv2, are two different operators with the same operator type.
Model conversion
Process of converting the original floating-point model or the ONNX model converted by QAT into a Horizon hybrid heterogeneous model.
Model quantization
Currently one of the most effective model optimization methods in industry. Quantization is to establish data mapping relationships between fixed-point data and floating-point data to achieve inference performance gains with little precision loss, which can be simply understood as using "low-bit" numbers to represent FP32 or other types of values, e.g., FP32 --> INT8 can achieve 4 times parameter compression, and faster calculations can be achived while memory usage is reduced.
The Quantize node is used to quantize the input data of the model from the [float] type to [int8] type, which uses the following formula:
round(x) rounds the floating point number.clamp(x) clamps the data to an integer value between -128 and 127.scale is the quantized scale factor.zero_point is the asymmetric quantization zero-point offset value. When in symmetric quantization, zero_point = 0.The C++ reference implementation is as follows:
The Dequantize node is used to dequantize output data of the model from the int8 or int32 type back to float or double type with the following formula:
The C++ reference implementation is as follows.
PTQ
PTQ conversion scheme, a quantization method that first trains a floating-point model and then uses a calibration image to calculate quantization parameters to convert the floating-point model into a quantized model. For more details, refer to the PTQ and QAT Introduction section.
QAT
QAT (quantization-aware training) scheme, which intervenes in the floating-point model structure during the floating-point training to enable the model to perceive the loss from quantization and reduce the quantization loss accuracy. For more details, refer to the PTQ and QAT Introduction section.
Tensor
The Tensor is a multidimensional array with a uniform data type, as a container for the data computed by the operator, it contains the input and output data. The carrier of tensor specific information, contains the name, shape, data layout, data type, etc. of the tensor data.
Data layout
In the deep learning, multidimensional data is stored through the multidimensional array (tensor), and the generic neural network featuremaps are usually stored using the four-dimensional array (i.e., 4D) format, i.e., the following four dimensions:
However, the data can only be stored linearly, so the four dimensions have a corresponding order, and different layout formats of the data will affect the computational performance. The common data storage formats are NCHW and NHWC:
As shown below:
Data type
The image data types commonly used below include rgb, bgr, gray, yuv444, nv12, and featuremap.
For more information about abbreviations in the documents, please refer to the section Common Abbreviations.