NCU 錯誤訊息 ERR_NVGPUCTRPERM: Permission issue with Performance Counters
作業環境
Ubuntu 20.04 with Dell rack server, A100 Graphic card
在用 NCU 要對 NCG container 中的 Bert 環境進行 Profiling 時候跳出這個錯誤
==ERROR== ERR_NVGPUCTRPERM – The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
解決方法其實連結中的官方文件就有了
主要要注意的事情是因為 Bert 官方教學範例是執行在 docker container 中,但這個修正其實是要在 Host 開放權限的,倒是從 host 修正之後不用 reboot 也不用重開 container 讓我有一點意外
Enable access temporarily
Before you can insert the kernel module with the required key set/unset, you first need to stop the window manager and unload all NVIDIA kernel modules. As root, or with sudo:
Stop the window manager with systemctl isolate multi-user (or your system-specific solution).
Unload modules with modprobe -rf nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia
To allow access for any user, run modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0
To restrict access to admin users (CAP_SYS_ADMIN capability set), run modprobe nvidia NVreg_RestrictProfilingToAdminUsers=1
If desired, restart the window manager with systemctl isolate graphical (or your system-specific solution).