/usr/local/lib/python3.10/dist-packages/horizon_plugin_pytorch/torch_patch.py:13: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _register_pytree_node(slice, _slice_flatten, _slice_unflatten) `aidisdk` dependency is not available. `aidisdk` dependency is not available. INFO:hat.engine.ddp_trainer:Launch with rank: 7 world_size: None hostname: OE-J6-GPU-3-0-22 dist_url: tcp://localhost:10763 num_devices: 8 num_processes: 8 INFO:hat.engine.ddp_trainer:Launch with rank: 5 world_size: None hostname: OE-J6-GPU-3-0-22 dist_url: tcp://localhost:10763 num_devices: 8 num_processes: 8 INFO:hat.engine.ddp_trainer:Launch with rank: 0 world_size: None hostname: OE-J6-GPU-3-0-22 dist_url: tcp://localhost:10763 num_devices: 8 num_processes: 8 [W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) W1213 15:51:45.703000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 173 via signal SIGTERM W1213 15:51:45.704000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 174 via signal SIGTERM W1213 15:51:45.704000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 175 via signal SIGTERM W1213 15:51:45.704000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 176 via signal SIGTERM W1213 15:51:45.704000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 177 via signal SIGTERM W1213 15:51:45.704000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 179 via signal SIGTERM W1213 15:51:45.704000 140551106629632 torch/multiprocessing/spawn.py:145] Terminating process 180 via signal SIGTERM ERROR:__main__:train failed! -- Process 5 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 75, in _wrap fn(i, *args) File "/usr/local/lib/python3.10/dist-packages/hat/engine/ddp_trainer.py", line 505, in _main_func torch.cuda.set_device(local_rank % num_devices) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 399, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Traceback (most recent call last): File "/open_explorer/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 287, in raise e File "/open_explorer/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 273, in train( File "/open_explorer/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 254, in train launch( File "/usr/local/lib/python3.10/dist-packages/hat/engine/ddp_trainer.py", line 426, in launch mp.spawn( File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 281, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 237, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 188, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 5 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 75, in _wrap fn(i, *args) File "/usr/local/lib/python3.10/dist-packages/hat/engine/ddp_trainer.py", line 505, in _main_func torch.cuda.set_device(local_rank % num_devices) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 399, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.