Model Inference Application Development Guide

Overview

This chapter introduces how to develop model inference applications on the Horizon platform and highlights relevant considerations you need to be aware of.

Attention

Prior to the application development, make sure that you have completed the development environment preparations as described in Environment Deployment .

The simplest application development can be divided into 3 stages: project creation, project implementation, and project compilation and operation.

However, given the fact that the development of actual business scenarios are more complicated, we also provide explanations on multi-model control concepts and suggestions on application tuning.

Project Creation

We recommend using CMake to manage your application development engineering.

As described in the previous sections, by now, you should have installed CMake. Before reading this section, we assume that you understand how to use CMake.

The Horizon Development Library provides relevant project dependencies. The specific dependencies are listed below:

  • Horizon deployment library libdnn.so and libucp.so under ${OE_DIR}/samples/ucp_tutorial/deps_aarch64/ucp/.
  • The aarch64-linux-gnu-gcc C compiler.
  • The aarch64-linux-gnu-g++ C++ compiler.
Note

The $ {OE_DIR} above refers to the OE package path provided by Horizon.

To create a new project, you need to compile the CMakeLists.txt file.

The CMakeLists.txt file defines some compilation options, as well as the path to the dependency libs and header files, as follows:

cmake_minimum_required(VERSION 2.8) project(your_project_name) # libdnn.so depends on system software dynamic link library, use -Wl,-unresolved-symbols=ignore-in-shared-libs to shield during compilation set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wl,-unresolved-symbols=ignore-in-shared-libs") set(CMAKE_CXX_FLAGS_DEBUG " -Wall -Werror -g -O0 ") set(CMAKE_C_FLAGS_DEBUG " -Wall -Werror -g -O0 ") set(CMAKE_CXX_FLAGS_RELEASE " -Wall -Werror -O3 ") set(CMAKE_C_FLAGS_RELEASE " -Wall -Werror -O3 ") if (NOT CMAKE_BUILD_TYPE) set(CMAKE_BUILD_TYPE Release) endif () message(STATUS "Build type: ${CMAKE_BUILD_TYPE}") # define dnn lib path set(DNN_PATH "${OE_DIR}/samples/ucp_tutorial/deps_aarch64/ucp/") set(DNN_LIB_PATH ${DNN_PATH}/lib) include_directories(${DNN_PATH}/include) link_directories(${DNN_LIB_PATH}) add_executable(user_app main.cc) target_link_libraries(user_app dnn ucp pthread rt dl)
Note

In the above sample, we did not specify the compiler location. We will specify it at the project compilation stage, as described in the section Project Compilation and Running .

Project Implementation

This section explains you how to run the hbm models on Horizon platforms.

The simplest procedure consists of model loading, input data preparations, output memory preparations, inference and result parsing. The sample code for simple model deployment are as follows:

#include <iostream> #include "hobot/dnn/hb_dnn.h" #include "hobot/hb_ucp.h" #include "hobot/hb_ucp_sys.h" int main(int argc, char **argv) { // Step 1: Load the model hbDNNPackedHandle_t packed_dnn_handle; const char* model_file_name= "./mobilenetv1/mobilenetv1_224x224_nv12.hbm"; hbDNNInitializeFromFiles(&packed_dnn_handle, &model_file_name, 1); // Step 2: Get model names const char **model_name_list; int model_count = 0; hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle); // Step 3: Get dnn_handle hbDNNHandle_t dnn_handle; hbDNNGetModelHandle(&dnn_handle, packed_dnn_handle, model_name_list[0]); // Step 4: Prepare input data int input_count = 0; hbDNNGetInputCount(&input_count, dnn_handle); std::vector<hbDNNTensor> input(input_count); for (int i = 0; i < input_count; i++) { hbDNNGetInputTensorProperties(&input[i].properties, dnn_handle, i); auto &mem = input[i].sysMem[0]; /* 1. For dynamic input, you need to set the corresponding dynamic parameters in input[i].properties. 2. Call hbUCPMalloc/hbUCPMallocCached to apply for the corresponding memory size based on the model input. such as hbUCPMallocCached(&mem, size, 0); 3. Determine whether the input needs quantization or padding based on properties, then fill the input data into mem. 4. If the memory is a cacheable, you must actively perform a flush operation after writing. such as hbUCPMemFlush(&mem, HB_SYS_MEM_CACHE_CLEAN); */ } // Step 5: Prepare storage space for the output data of the model int output_count = 0; hbDNNGetOutputCount(&output_count, dnn_handle); std::vector<hbDNNTensor> output(output_count); for (int i = 0; i < output_count; i++) { hbDNNTensorProperties &output_properties = output[i].properties; hbDNNGetOutputTensorProperties(&output_properties, dnn_handle, i); int out_aligned_size = output_properties.alignedByteSize; hbUCPSysMem &mem = output[i].sysMem[0]; hbUCPMallocCached(&mem, out_aligned_size, 0); } // Step 6: Create the asynchronous inference task hbUCPTaskHandle_t task_handle{nullptr}; hbDNNInferV2(&task_handle, output.data(), input.data(), dnn_handle); // Step 7: Commit the task hbUCPSchedParam infer_sched_param; HB_UCP_INITIALIZE_SCHED_PARAM(&infer_sched_param); hbUCPSubmitTask(task_handle, &infer_sched_param); // Step 8: Wait for the task to end hbUCPWaitTaskDone(task_handle, 0); // Step 9: Parse model output, here we take the classification model as an example to get the top1 result for (int i = 0; i < output_count; i++) { // For output memory that applies for the cacheable attribute, it needs to be actively flushed before reading. hbUCPMemFlush(&(output[i].sysMem[0]), HB_SYS_MEM_CACHE_INVALIDATE); /* 1. Determine whether the output data has padding, whether inverse quantization is required, and the quantization parameter information required for inverse quantization, etc. 2. parsing process. */ } // Release the task hbUCPReleaseTask(task_handle); // Release the memory for (int i = 0; i < input_count; i++) { hbUCPFree(&(input[i].sysMem[0])); } for (int i = 0; i < output_count; i++) { hbUCPFree(&(output[i].sysMem[0])); } // Release the model hbDNNRelease(packed_dnn_handle); return 0; }

To keep it simple, part of the model processing in above sample is described in the form of comments. More details are explained in subsequent documents, such as:

For dynamic input instructions, please refer to section Dynamic Input Instruction.

For memory alignment rules, please refer to section Alignment Rule.

For more comprehensive instructions on the engineering implementation, refer to sections Model Inference API Instruction and Basic Sample User Guide.

Project Compilation and Running

Combining with CMake engineering configurations as described in Project Creation, please refer to the following compilation script:

# Define gcc path for ARM LINARO_GCC_ROOT=/usr DIR=$(cd "$(dirname "$0")";pwd) export CC=${LINARO_GCC_ROOT}/bin/aarch64-linux-gnu-gcc export CXX=${LINARO_GCC_ROOT}/bin/aarch64-linux-gnu-g++ rm -rf build_arm mkdir build_arm cd build_arm cmake ${DIR} make -j8

After reading Environment Deployment, we assume that you have installed the required compiler on your dev PC, so here you only need to associate the compiler configurations in the above script with your project.

Copy the arm program to the Horizon board to run, note that the program dependencies also need to be copied to the board together, and configure the dependencies in the startup script. For example, our example program depends on the following libraries: libucp.so, libdnn.so and other bsp libraries. These dependencies can be found in the OE package under the path ucp_tutorial/deps_aarch64/ and need to be uploaded to the board's runtime environment. We recommend that you create a new lib path under the /userdata path on the board side and transfer the libraries to that directory, the paths to the dependency libraries that need to be specified before running the program on the board side are as follows:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/userdata/lib

Multi-model Control Strategy

In the scenarios containing multiple models, as each model need to complete the inference with limited resources, they will inevitably compete for computing resources.

To help you control the execution of multiple models, we provide control strategies for the model prioritization.

Model Preemption Control

Attention

This feature is only supported on the dev board side and is not supported by the x86 emulator.

There isn't task preemption feature in the BPU computing unit hardware of the J6 ASIC. Each inference task, once put to the BPU and begins model computing, it occupies the BPU until the task is completed. At this time, other tasks have to wait in line. If the BPU is occupied by a large model inference task, then other high-priority model inference tasks cannot be executed.

To fix this, we added a software feature called BPU Resource Preemption in the Runtime SDK based on model priorities.

Pay attention to the following:

  • When executing inference in BPU, the compiled data command model is denoted by 1 or more function calls. The function call means the atomic execution unit of the BPU, and multiple function-call tasks are queued in the hardware queue and processed in turn. A model inference task will be considered done when all of it function calls are executed.
  • Based on the above descriptions, it is simpler to set the function call as the preemption granularity of the BPU model task, that is, when the BPU finishes a function call, it can temporarily suspend the existing model, switch to another model, and then resume it when the latter is done. However, there are 2 problems, the first is that the function calls of the model compiled by the compiler are merged together to form a large function call and cannot be preempted. The second problem is that the execution time of each function call is relatively long or not fixed, which leads to unfixed preemption timing, affecting the preemption results.

To solve these two problems, we provide supports in both model conversion and system software. The implementation principles and operation methods are as follows:

  • Firstly, if you choose to process the model using the QAT scheme, then at the model compilation stage, you need to add the max-time-per-fc option to the extra parameter configurations in the compilation interface to set the execution time (in microseconds) for each function call. The default value is 0 (no limits). By setting this option, you can control the execution time of individual large function calls when they are running on-board. Suppose the execution time of a function call is 10ms, and max-time-per-fc is set to 500 during model compilation, then this function call will be split into 20 function calls. If you are using the PTQ scheme to process the model, you can add the max_time_per_fc parameter to the compiler-related parameters (compiler_parameters) in the YAML configuration file of the model at the model conversion stage.
  • Secondly, the hbUCPSchedParam.priority parameter needs to be set when the reasoning task is submitted. High-optimization preemption nesting capabilities can also be supported according to priority. For example, if you configure the infer task with a priority less than 254, it is a normal task and cannot preempt other tasks. Configure infer task with a priority equal to 254 to be a hight preemption task, which can support preemption of normal tasks. Configure infer task with a priority equal to HB_DNN_PRIORITY_PREEMP(255) to be a urgent preemptive task, which can preempt both normal and hight preemptive tasks.

Suggestions on Application Optimization

Horizon suggested application optimization strategy includes Engineering Task Scheduling and Algorithm Task Integration.

For Engineering Task Scheduling, we recommend some workflow scheduling management tools to fully utilize the parallel-processing capabilities at different task stages.

In general, an application can be divided into 3 stages: pre-processing, model inference, and post-processing output.

A simplified workflow is as follows:

app_optimization_1

After making full use of the workflow management to achieve the parallel execution of different task stages, the ideal task processing workflow can be as follows:

app_optimization_2

For Algorithm Task Integration, we recommend multi-task models.

On one hand, it can avoid the difficulties brought by the management of multi-model scheduling to a certain extent.

On the other hand, as multi-task model can fully share the computation of the backbone, it can significantly reduce the amount of computation at the entire application level compared to using independent models, and thereby achieve higher overall performance.

Multitasking is also a common application-level optimization strategy within Horizon Robotics and in the business practices of many collaborating customers.

Other Dev Tools

The hrt_model_exec is a model execution tool that can evaluate the inference performance of the model and get the model information directly on the dev board. On one hand, it allows the user to get a realistic understanding of the model's real performance; On the other hand, it also helps the user to learn the speed limit that the model can achieve, which is useful information in application tuning.

The hrt_model_exec provides two types of functions including model inference infer and viewing model information model_info. For how to use the tool, please refer to hrt_model_exec Tool Introduction.

UCP also provides performance analysis tools to assist you in locating application performance bottlenecks. Among them, UCP Trace is used to analyze the application pipline scheduling capability, and hrt_ucp_monitor is used to monitor the occupancy rate of the monitoring hardware backend.

Please refer to the section UCP Trace Instructions and The hrt_ucp_monitor Tool Introduction for how to use these tools.