StereoNet Binocular Depth Estimation Model Training

This tutorial mainly shows you how to use HAT to train a StereoNet model from scratch on the dataset SceneFlow, including floating-point, quantized, and fixed-point models.

Dataset Preparation

Before starting to train the model, the first step is to prepare the dataset, which can be downloaded in the SceneFlow dataset . At the same time, you need to prepare the list of files corresponding to the training data and validation dataset, and you can download SceneFlow_finalpass_train.txt and SceneFlow_finalpass_test.txt from here .

After downloading, unzip and organize the folder structure as follows:

data |-- SceneFlow |-- Driving |-- disparity |-- frames_finalpass |-- FlyingThings3D |-- disparity |-- frames_finalpass |-- Monkaa |-- disparity |-- frames_finalpass |-- SceneFlow_finalpass_test.txt |-- SceneFlow_finalpass_train.txt

In order to improve the speed of training, we made a package of data information files and converted them into a dataset in LMDB format. Just run the script below to successfully achieve the conversion:

python3 tools/datasets/sceneflow_packer.py --src-data-dir ${data-dir} --split-name train --pack-type lmdb --num-workers 10 --target-data-dir ${target-data-dir} python3 tools/datasets/sceneflow_packer.py --src-data-dir ${data-dir} --split-name test --pack-type lmdb --num-workers 10 --target-data-dir ${target-data-dir}

The above two commands correspond to the transformation training dataset and the validation dataset respectively, after packaging, the file structure in the ${target-data-dir} directory should be as follows:

${target-data-dir} |-- train_lmdb |-- test_lmdb

train_lmdb and test_lmdb are packaged training datasets and validation datasets, and then you can start training the model.

Model Training

Before the network starts training, you can use the following command to calculate the amount of computation and the number of parameters for the network:

python3 tools/calops.py --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

The next step is to start training. Training can also be done through the following script, and you need to confirm whether the dataset path in the configuration has been switched to the packaged dataset path before training.

python3 tools/train.py --stage "float" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py python3 tools/train.py --stage "calibration" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py python3 tools/train.py --stage "qat" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

Since the HAT algorithm package uses the registration mechanism, it allows each training task to be started in the form of train.py plus a config file. The train.py is a uniform training script and independent of the task, and the tasks we need to train, the datasets we need to use, and the hyperparameter settings related to training are all in the specified config file.

The parameters after --stage in the above command can be "float", "calibration", "qat", which, respectively, indicates the training of the floating-point model and the quantitative model, and the conversion of the quantitative model to the fixed-point model, where the training of the quantitative model depends on the floating-point model produced by the previous floating-point training.

Export FixedPoint Model

Once you've completed your quantization training, you can start exporting your fixed-point model. You can export it with the following command:

python3 tools/export_hbir.py --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

Model Verification

After completing the training, we get the trained floating-point, quantitative, or fixed-point model. Similar to the training method, we can use the same method to complete metrics validation on the trained model and get the metrics of Float, Calibration, QAT, and Quantized, which are floating-point, quantitative, and fully fixed-point metrics, respectively.

python3 tools/predict.py --stage "float" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py python3 tools/predict.py --stage "calibration" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py python3 tools/predict.py --stage "qat" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

Similar to the model training, we can use --stage followed by "float", "calibration", "qat", to validate the trained floating-point model, quantitative model, respectively.

The following command can be used to verify the accuracy of a fixed-point model, but it should be noted that hbir must be exported first:

python3 tools/predict.py --stage "int_infer" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

Model Inference

HAT provides the infer_hbir.py script to visualize the inference results for the fixed-point model:

python3 tools/infer_hbir.py --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py --model-inputs imgl:${img1-path},imgr:${img2-path},baseline:${baseline},f:${f} --save-path ${save_path}

Simulation Board Accuracy Verification

In addition to the above model validation, we provide an accuracy validation method identical to the on-board environment, which can be accomplished by:

python3 tools/validation_hbir.py --stage "align_bpu" --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

Fixed-point Model Checking and Compilation

As the quantitative training toolchain integrated in HAT is mainly prepared for Horizon's processors, it is a must to check and compile the quantitative models.

We provide an interface for model checking in HAT, which allows the user to define a quantitative model and then check whether it can work properly on the BPU first.

python3 tools/model_checker.py --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

After the model is trained, you can use the compile_perf_hbir script to compile the quantitative model into an HBM file that supports on-board running. The tool can also predict the performance on the BPU.

python3 tools/compile_perf_hbir.py --config configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py

The above is the whole process from data preparation to the generation of quantitative and deployable models.

Training Details

In this note, we explain some things that need to be considered for model training, mainly including settings related to config.

Model Construction

The network structure of StereoNet can be found in the Paper, which is not described in detail here.

We can easily define and modify the model by defining a dict type variable like model in the config file.

from torch import nn loss_weights = [0.3, 0.3, 0.5, 0.5, 1.0] maxdisp = 192 use_bn = True bias = False bn_kwargs = {} refine_levels = 4 out_channels = [32, 32, 64, 128, 128, 16] model = dict( type="StereoNet", backbone=dict( type="StereoNetNeck", out_channels=out_channels, use_bn=use_bn, bias=bias, bn_kwargs=bn_kwargs, act_type=nn.ReLU(), ), head=dict( type="StereoNetHead", maxdisp=maxdisp, bn_kwargs=bn_kwargs, refine_levels=refine_levels, ), post_process=dict( type="StereoNetPostProcess", maxdisp=maxdisp, ), loss=dict(type="SmoothL1Loss"), loss_weights=loss_weights, )

In addition to the backbone, the model also has head, post_process, losses modules. Among them, backbone is mainly to extract the features of the image, and head is mainly used by the features to obtain the predicted parallax value. The post_process is mainly the post-processing part, and the losses module uses the SmoothL1Loss in the paper as the training loss, and the loss_weights is the weight of the corresponding loss.

Data Enhancement

Like the definition of model, the data enhancement process is implemented by defining data_loader and val_data_loader in the config file, which correspond to the processing of the training sets and verification sets, respectively. Taking data_loader as an example, the data enhancement uses RandomCrop, ToTensor, and Normalize to increase the diversity of training data and enhance the generalization ability of the model.

Since the final model running on BPU uses a YUV444 image input, and the general training image input is in the RGB format, HAT provides BgrToYuv444 data enhancement to convert RGB to YUV444.

data_loader = dict( type=torch.utils.data.DataLoader, dataset=dict( type="SceneFlow", data_path="./tmp_data/SceneFlow/train_lmdb", transforms=[ dict( type="RandomCrop", size=(256, 512), ), dict( type="ToTensor", to_yuv=False, use_yuv_v2=False, ), dict(type="BgrToYuv444", rgb_input=True), dict( type="TorchVisionAdapter", interface="Normalize", mean=128.0, std=128.0, ), ], ), sampler=dict(type=torch.utils.data.DistributedSampler), batch_size=train_batch_size_per_gpu, pin_memory=True, shuffle=False, num_workers=data_num_workers, collate_fn=collate_2d, )

A loss_collector function is passed in batch_processor to get the loss for the current batch data, as follows:

def loss_collector(outputs: dict): return outputs["losses"] batch_processor = dict( type="MultiBatchProcessor", need_grad_update=True, loss_collector=loss_collector, )

The data transformation of the validation set is relatively simpler, as follows:

val_data_loader = dict( type=torch.utils.data.DataLoader, dataset=dict( type="SceneFlow", data_path="./tmp_data/SceneFlow/test_lmdb", transforms=[ dict( type="ToTensor", to_yuv=False, use_yuv_v2=False, ), dict(type="BgrToYuv444", rgb_input=True), dict( type="TorchVisionAdapter", interface="Normalize", mean=128.0, std=128.0, ), ], ), sampler=dict(type=torch.utils.data.DistributedSampler), batch_size=test_batch_size_per_gpu, pin_memory=True, shuffle=False, num_workers=data_num_workers, collate_fn=collate_2d, )
val_batch_processor = dict( type="MultiBatchProcessor", need_grad_update=False, loss_collector=None, )

Training Strategy

Training a floating-point model on the SceneFlow dataset uses the learning strategy of Cosine with Warmup, and impose L2 norm on the weight parameter.

The float_trainer, calibration_trainer, qat_trainer, and int_trainer in the configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py file correspond to the training strategies for floating-point, quantitative, and fixed-point models, respectively.

The following is an example of float_trainer training strategy:

float_trainer = dict( type="distributed_data_parallel_trainer", model=model, data_loader=data_loader, optimizer=dict( type=torch.optim.Adam, params={"weight": dict(weight_decay=4e-5)}, lr=base_lr, ), batch_processor=train_batch_processor, stop_by="epoch", num_epochs=num_epochs, device=None, sync_bn=True, callbacks=[ stat_callback, loss_show_callback, dict( type="CosLrUpdater", warmup_by="epoch", warmup_len=10, step_log_interval=1000, ), val_callback, ckpt_callback, ], train_metrics=[ dict(type="LossShow"), dict( type="EndPointError", use_mask=True, ), ], val_metrics=[ dict( type="EndPointError", use_mask=True, ), ], )

Quantization Training

For key steps in quantitative training, such as preparing the floating-point model, operator substitution, inserting quantization and inverse quantization nodes, setting quantitative parameters, and operator fusion, please read the Quantized Awareness Training (QAT) section. Here we focus on how to define and use the quantization models in binocular depth estimation of HAT.

If the model is ready, and some existing modules are quantized, HAT uses the following script in the training script to map the floating-point model to the fixed-point model uniformly.

model.fuse_model() model.set_qconfig() horizon.quantization.prepare_qat(model, inplace=True)

The overall strategy of quantitative training can directly follow the strategy of floating-point training, but the learning rate and training length need to be adjusted appropriately. Due to the existence of the floating-point pre-training model, the learning rate Lr for quantitative training can be very small. Generally, you can start from 0.001 or 0.0001, and you can do one or two times Lr adjustments of scale=0.1 with StepLrUpdater without prolonging the training time. In addition, weight decay will also have some effect on the training results.

The quantitative training strategy for the StereoNet example model can be found in the configs/disparity_pred/stereonet/stereonet_stereonetneck_sceneflow.py file.