NCU 錯誤訊息 ERR_NVGPUCTRPERM: Permission issue with Performance Counters

作業環境
Ubuntu 20.04 with Dell rack server, A100 Graphic card

在用 NCU 要對 NCG container 中的 Bert 環境進行 Profiling 時候跳出這個錯誤

==ERROR== ERR_NVGPUCTRPERM – The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

解決方法其實連結中的官方文件就有了
主要要注意的事情是因為 Bert 官方教學範例是執行在 docker container 中,但這個修正其實是要在 Host 開放權限的,倒是從 host 修正之後不用 reboot 也不用重開 container 讓我有一點意外

Enable access temporarily
Before you can insert the kernel module with the required key set/unset, you first need to stop the window manager and unload all NVIDIA kernel modules. As root, or with sudo:
Stop the window manager with systemctl isolate multi-user (or your system-specific solution).
Unload modules with modprobe -rf nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia
To allow access for any user, run modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0
To restrict access to admin users (CAP_SYS_ADMIN capability set), run modprobe nvidia NVreg_RestrictProfilingToAdminUsers=1
If desired, restart the window manager with systemctl isolate graphical (or your system-specific solution).

參考資料

關於

AI Computing / 武術 / 登山 / IT / - 貪多而正努力咀嚼的人生小吃貨