MapTR Training

The MapTR reference algorithm is developed based on Horizon Algorithm Toolkit (HAT, Horizon's own deep learning algorithm toolkit). The training config for MapTR is located under configs/map/ path. The following part takes configs/map/maptrv2_resnet50_bevformer_nuscenes.py as an example to describe how to configure and train MapTR model.

Training Process

If you just want to simply train the MapTR model, then you can read this section first.

Similar to other tasks, HAT performs all training tasks and evaluation tasks in the form of tools + config.

After preparing the original dataset, take the following process to complete the whole training process.

Dataset Preparation

Here is an example of the nuscense dataset, which can be downloaded from https://www.nuscenes.org/nuscenes . Also, in order to improve the speed of training, we have done a packing of the original jpg format dataset to convert it to lmdb format. Just run the following script and it will be successful to achieve the conversion.

python3 tools/datasets/nuscenes_packer.py --src-data-dir WORKSAPCE/datasets/nuscenes/ --pack-type lmdb --target-data-dir . --version v1.0-trainval --split-name val python3 tools/datasets/nuscenes_packer.py --src-data-dir WORKSAPCE/datasets/nuscenes/ --pack-type lmdb --target-data-dir . --version v1.0-trainval --split-name train

The above two commands correspond to transforming the training dataset and the validation dataset, respectively. After the packing is completed, the file structure in the data directory should look as follows.

tmp_data |-- nuscenes |-- metas |-- v1.0-trainval |-- train_lmdb |-- val_lmdb

The train_lmdb and val_lmdb are the training and validation datasets after packaging, and are the datasets that the network will eventually read. metas contains the map information used during model training and validation.

Model Training

Before the network starts the training, you can first calculate the number of network operations and parameters using the following command:

python3 tools/calops.py --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

The next step is to start the training. Training can be done with the following script. Before training, you need to make sure that the dataset path specified in the configuration has already been changed to the path of the packaged dataset.

python3 tools/train.py --stage float --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py python3 tools/train.py --stage calibration --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py python3 tools/train.py --stage qat --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

Since the HAT algorithm package uses the registration mechanism, it allows each training task to be started in the form of train.py plus a config file. The train.py is a uniform training script and independent of the task, and the tasks we need to train, the datasets we need to use, and the hyperparameter settings related to training are all in the specified config file.

The parameters after --stage in the above command can be float, calibration, qat, which, respectively, indicates the training of the float model, the calibration model and the qat model. Calibration model training depends on float model checkpoint, while qat model training depends on calibration model checkpoint. Please read the Quantized Awareness Training (QAT) section.

Export Quantization Model

Once you've completed your qat training, you can start exporting your quantization model. You can export it with the following command:

python3 tools/export_hbir.py --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

Model Validation

After completing the training, we get the trained float, calibration, and qat model. Similar to the training method, we can use the same method to complete metrics validation on the trained model and get the metrics of Float, Calibration, and QAT respectively.

python3 tools/predict.py --stage float --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py python3 tools/predict.py --stage calibration --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py python3 tools/predict.py --stage qat --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

Similar to the model training, we can use --stage followed by float, calibration, or qat to validate the specific model respectively.

The following command can be used to verify the accuracy of a quantization model, but it should be noted that hbir must be exported first:

python3 tools/predict.py --stage int_infer --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

Model Inference

HAT provides the infer_hbir.py script to visualize the inference results for the quantization model:

python3 tools/infer_hbir.py --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py --model-inputs img:${img-path} --save-path ${save_path}

Simulation Board Accuracy Verification

In addition to the above model validation, we provide an accuracy validation method identical to the on-board environment, which can be accomplished by:

python3 tools/validation_hbir.py --stage align_bpu --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

Quantization Model Checking and Compilation

As the quantitative training toolchain integrated in HAT is mainly prepared for Horizon's processors, it is a must to check and compile the quantitative models.

We provide an interface for model checking in HAT, which allows the user to define a quantitative model and then check whether it can work properly on the BPU first.

python3 tools/model_checker.py --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

After the model is trained, you can use the compile_perf_hbir script to compile the quantitative model into an HBM file that supports on-board running.

python3 tools/compile_perf_hbir.py --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

The above is the whole process from data preparation to the generation of quantitative and deployable models.

Training Details

In this note, we explain some things that need to be considered for model training, mainly including settings related to config.

Model Construction

model = dict( type="MapTRv2", out_indices=(-1,), backbone=dict( type="ResNet50", num_classes=1000, bn_kwargs=bn_kwargs, include_top=False, ), neck=dict( type="FPN", in_strides=[32], in_channels=[2048], out_strides=[32], out_channels=[_dim_], bn_kwargs=dict(eps=1e-5, momentum=0.1), ), view_transformer=dict( type="SingleBevFormerViewTransformer", bev_h=bev_h_, bev_w=bev_w_, pc_range=point_cloud_range, num_points_in_pillar=4, embed_dims=_dim_, queue_length=queue_length, in_indices=(-1,), single_bev=single_bev, use_lidar2img=use_lidar2img, positional_encoding=dict( type="LearnedPositionalEncoding", num_feats=_pos_dim_, row_num_embed=bev_h_, col_num_embed=bev_w_, ), encoder=dict( type="SingleBEVFormerEncoder", num_layers=1, return_intermediate=False, bev_h=bev_h_, bev_w=bev_w_, embed_dims=_dim_, encoder_layer=dict( type="SingleBEVFormerEncoderLayer", embed_dims=_dim_, selfattention=dict( type="HorizonMSDeformableAttention", embed_dims=_dim_, num_levels=1, view_gird_in=4, view_gird_out=4, batch_first=True, feats_size=[[bev_w_, bev_h_]], ), crossattention=dict( type="HorizonSpatialCrossAttention", view_num=8, deformable_attention=dict( type="HorizonMSDeformableAttention3D", embed_dims=_dim_, num_points=8, num_levels=_num_levels_, view_gird_in=160, view_gird_out=100, feats_size=[[25, 15]], ), embed_dims=_dim_, ), dropout=0.1, ), ), ), bev_decoders=[ dict( type="MapTRPerceptionDecoderv2", bev_h=bev_h_, bev_w=bev_w_, embed_dims=_dim_, pc_range=point_cloud_range, queue_length=queue_length, numcam=6, num_vec_one2one=50, num_vec_one2many=300, k_one2many=6, num_pts_per_vec=fixed_ptsnum_per_pred_line, num_pts_per_gt_vec=fixed_ptsnum_per_gt_line, dir_interval=1, query_embed_type="instance_pts", transform_method="minmax", gt_shift_pts_pattern="v2", num_classes=num_map_classes, code_size=2, aux_seg=aux_seg_cfg, decoder=dict( type="MapTRDecoder", num_layers=6, return_intermediate=True, decoder_layer=dict( type="DecoupledDetrTransformerDecoderLayer", embed_dims=_dim_, crossattention=dict( type="HorizonMSDeformableAttention", embed_dims=_dim_, num_levels=1, view_gird_out=4, view_gird_in=4, feats_size=[[bev_w_, bev_h_]], ), dropout=0.1, ), ), criterion=dict( type="MapTRCriterion", dir_interval=1, num_classes=num_map_classes, code_weights=[1.0, 1.0, 1.0, 1.0], sync_cls_avg_factor=True, pc_range=point_cloud_range, num_pts_per_vec=fixed_ptsnum_per_pred_line, # one bbox num_pts_per_gt_vec=fixed_ptsnum_per_gt_line, gt_shift_pts_pattern="v2", aux_seg=aux_seg_cfg, assigner=dict( type="MapTRAssigner", cls_cost=dict(type="FocalLossCost", weight=2.0), pts_cost=dict(type="OrderedPtsL1Cost", weight=5), pc_range=point_cloud_range, ), loss_cls=dict( type="FocalLoss", loss_name="cls", num_classes=num_map_classes + 1, alpha=0.25, gamma=2.0, loss_weight=2.0, reduction="mean", ), loss_pts=dict(type="PtsL1Loss", loss_weight=5.0), loss_dir=dict(type="PtsDirCosLoss", loss_weight=0.005), loss_seg=dict( type="SimpleLoss", pos_weight=4.0, loss_weight=1.0 ), loss_pv_seg=dict( type="SimpleLoss", pos_weight=1.0, loss_weight=2.0 ), ), post_process=dict( type="MapTRPostProcess", post_center_range=post_center_range, pc_range=point_cloud_range, max_num=50, num_classes=num_map_classes, ), ), ], )

Where type under model indicates the name of the defined model, and the remaining variables indicate the other components of the model. The advantage of defining the model this way is that we can easily replace the structure we want. For example, if we want to train a model with a backbone of resnet18, we just need to replace backbone under model .

Data Augmentation

Like the definition of model, the data enhancement process is implemented by defining two dicts data_loader and val_data_loader in the config file as follows, corresponding to training set and the processing flow of the validation set.

data_loader = dict( type=torch.utils.data.DataLoader, dataset=dict( type="NuscenesMapDataset", data_path=os.path.join(data_rootdir, "train_lmdb"), map_path=meta_rootdir, pc_range=point_cloud_range, test_mode=False, bev_size=(bev_h_, bev_w_), fixed_ptsnum_per_line=fixed_ptsnum_per_gt_line, padding_value=-10000, map_classes=map_classes, queue_length=queue_length, aux_seg=aux_seg_cfg, with_bev_bboxes=False, with_ego_bboxes=False, with_bev_mask=False, use_lidar_gt=use_lidar_gt, transforms=[ dict(type="MultiViewsImgResize", size=(450, 800)), dict( type="MultiViewsImgTransformWrapper", transforms=[ dict( type="TorchVisionAdapter", interface="ColorJitter", brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1, ), dict(type="PILToNumpy"), dict( type="GridMask", use_h=True, use_w=True, rotate=1, offset=False, ratio=0.5, mode=1, prob=0.7, ), dict(type="ToTensor", to_yuv=False), dict(type="Pad", divisor=32), dict(type="BgrToYuv444", rgb_input=True), dict(type="Normalize", mean=128.0, std=128.0), ], ), ], ), sampler=dict(type=torch.utils.data.DistributedSampler), batch_size=batch_size_per_gpu, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_nuscenes_sequencev2, ) val_data_loader = dict( type=torch.utils.data.DataLoader, dataset=dict( type="NuscenesMapDataset", data_path=os.path.join(data_rootdir, "val_lmdb"), map_path=meta_rootdir, pc_range=point_cloud_range, test_mode=True, bev_size=(bev_h_, bev_w_), fixed_ptsnum_per_line=fixed_ptsnum_per_gt_line, padding_value=-10000, map_classes=map_classes, queue_length=test_queue_length, with_bev_bboxes=False, with_ego_bboxes=False, with_bev_mask=False, use_lidar_gt=use_lidar_gt, transforms=[ dict(type="MultiViewsImgResize", size=(450, 800)), dict( type="MultiViewsImgTransformWrapper", transforms=[ dict(type="PILToTensor"), dict(type="Pad", divisor=32), dict(type="BgrToYuv444", rgb_input=True), dict(type="Normalize", mean=128.0, std=128.0), ], ), ], ), sampler=None, batch_size=1, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_nuscenes_sequencev2, )

Where type directly uses the interface torch.utils.data.DataLoader that comes with pytorch, which represents the combination of batch_size size images together. The only thing to be concerned about here is probably the dataset variable, data_path means the packed lmdb dataset path, and map_path is the map information path as we mentioned in the first part of the dataset preparation. transforms contains a series of data enhancements. You can also achieve your own data augmentation by inserting a new dict in transforms.

Training Strategies

In order to train a model with high accuracy, a good training strategy is essential. For each training task, the corresponding training strategy is defined in the config file as well, which can be seen from the variable float_trainer.

float_trainer = dict( type="distributed_data_parallel_trainer", model=model, model_convert_pipeline=dict( type="ModelConvertPipeline", converters=[ dict( type="LoadCheckpoint", checkpoint_path=( "./tmp_pretrained_models/resnet50_imagenet/float-checkpoint-best.pth.tar" # noqa: E501 ), allow_miss=True, ignore_extra=True, verbose=False, ), ], ), data_loader=data_loader, optimizer=dict( type=torch.optim.AdamW, params={ "backbone": dict(lr_mult=0.1), }, lr=float_lr, weight_decay=0.01, ), batch_processor=batch_processor, device=None, num_epochs=24, callbacks=[ stat_callback, loss_show_update, grad_callback, dict( type="CosineAnnealingLrUpdater", warmup_len=500, warmup_by="step", warmup_lr_ratio=1.0 / 3, step_log_interval=500, stop_lr=1e-3 * float_lr, ), val_callback, ckpt_callback, ], sync_bn=True, train_metrics=dict( type="LossShow", ), val_metrics=[ val_map_metric, ], )

The float_trainer defines our training approach in the big picture, including the use of distributed_data_parallel_trainer, the number of epochs for model training, and the choice of optimizer. Also, the callbacks reflect the small strategies used by the model during training and the operations that the user wants to implement, including the way to transform the learning rate (CosineAnnealingLrUpdater), the metrics to validate the model during training (Validation), and the operations to save (Checkpoint) the model. Of course, if you have operations that you want the model to implement during training, you can also add them in this dict way.

Note

If reproducibility accuracy is needed, the training strategy in config is best not modified. Otherwise, unexpected training situations may occur.

Quantization Model Training

By using float_trainer, we can get a float model with high accuracy. Thus we can start to train the corresponding calibration and qat model. The corresponding training strategies are defined as follows.

calibration_trainer = dict( type="Calibrator", model=model, model_convert_pipeline=dict( type="ModelConvertPipeline", qat_mode=qat_mode, converters=[ dict( type="LoadCheckpoint", checkpoint_path=os.path.join( ckpt_dir, "float-checkpoint-best.pth.tar" ), ignore_extra=True, verbose=True, check_hash=False, ), dict( type="Float2Calibration", convert_mode=convert_mode, example_data_loader=calibration_example_data_loader, qconfig_setter=cali_qconfig_setter, ), ], ), data_loader=calibration_data_loader, batch_processor=calibration_batch_processor, num_steps=calibration_step, device=None, callbacks=[ stat_callback, calibration_ckpt_callback, calibration_val_callback, ], val_metrics=[ val_map_metric, ], log_interval=calibration_step / 10, ) qat_trainer = dict( type="distributed_data_parallel_trainer", model=model, model_convert_pipeline=dict( type="ModelConvertPipeline", qat_mode=qat_mode, converters=[ dict( type="Float2QAT", convert_mode=convert_mode, example_data_loader=copy.deepcopy( calibration_example_data_loader ), qconfig_setter=qat_qconfig_setter, ), dict( type="LoadCheckpoint", checkpoint_path=os.path.join( ckpt_dir, "calibration-checkpoint-best.pth.tar" ), ignore_extra=True, verbose=True, ), ], ), data_loader=data_loader, optimizer=dict( type=torch.optim.AdamW, lr=qat_lr, weight_decay=0.01, ), batch_processor=qat_batch_processor, device=None, num_epochs=2, callbacks=[ stat_callback, loss_show_update, grad_callback, qat_val_callback, qat_ckpt_callback, ], sync_bn=True, train_metrics=dict( type="LossShow", ), val_metrics=[ val_map_metric, ], )

Quantization training is in fact finetuning based on float training, so during quantization training, our qat learning rate is much smaller than the float learning rate. The number of epochs for quantization training is largely reduced, most importantly, when defining the model , our pretrained needs to be set to the path of the previous model checkpoint.

The Settings of Qconfig

Before we begin quantization training, we need to set qconfig of the model. Float model will be converted into a quantization model by the settings of qconfig as follows.

cali_qconfig_setter = ( sensitive_op_calibration_8bit_weight_16bit_act_qconfig_setter( pts_table, topk=30, ratio=None, ), default_calibration_qconfig_setter, ) qat_qconfig_setter = ( sensitive_op_qat_8bit_weight_16bit_fixed_act_qconfig_setter( pts_table, topk=30, ratio=None, ), default_qat_fixed_act_qconfig_setter, )

cali_qconfig_setter and qat_qconfig_setter are qconfig settings for calibration model and qat model specifically. For qconfig settings, such as default templates and sensitivity templates, please read the Quantized Awareness Training (QAT) Qconfig in Detail section.

Quantitative Sensitivity Operators Sorting

During the quantization training process, we need to set certain quantization-sensitive operators to int16 to meet the model's quantization accuracy requirements. The sorting of quantization-sensitive operators can be obtained by running the following command.

python3 tools/quant_analysis.py --config configs/map/maptrv2_resnet50_bevformer_nuscenes.py

The key steps can be found in the section Quantized Awareness Training (QAT) Accuracy Tuning Tool Guide.