GPU energy - NVIDIA NVML - Component

What it does

This metric provider gets the current GPU power draw from the NVIDIA SMI software.

Classname

  • GpuEnergyNvidiaNvmlComponentProvider

Metric Name

  • gpu_energy_nvidia_nvml_component

Prerequisites & Installation

We assume that the NVIDIA graphics card and the associated drivers are installed on your system.

Please resort to NVIDIA Docs for installation if you still need to install.

GMT will try to install the needed C header and development files for the Metrics Provider to compile.

You can trigger this by adding --nvidia-gpu to the install script. If the installation fails, please resort to your OS documentation. e.g.: NVIDIA Linux docs

Running your code on our hosted service

Please check on our Measurement Cluster page which CUDA version is installed. You must use the same CUDA version if you have compiled artifacts in your containers.

Debugging

If you cannot generate any output you should first check if your GPU is supported by NVIDIA CUDA on their list for CUDA support.

Then you should check if the kernel module was corretly loaded with dmesg. Sometimes a message like this appears:

The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:1081)
NVRM: installed in this system is not supported by open
NVRM: nvidia.ko because it does not include the required GPU
NVRM: System Processor (GSP).

In this case you should switch to the legacy kernel module

Check in sudo dmesg if the kernel module could correctly be lodaded and then verify through cat /proc/driver/nvidia/version. See also details on the NVIDIA support page

Input Parameters

  • args
    • -i: interval in milliseconds

By default the measurement interval is 100 ms.

./metric-provider-binary -i 100

Output

This metric provider prints to Stdout a continuous stream of data. The format of the data is as follows:

TIMESTAMP READING

Where:

  • TIMESTAMP: Unix timestamp, in microseconds
  • READING: The energy used by the GPU in milliWatts (Ex: 12230 for 12.23 Watts)
  • CARD NAME: The name of the graphics card as reported by the driver

Any errors are printed to Stderr.

Example:

1748166115636640 17757 "NVIDIA GeForce GTX 1080-0"

How it works

The provider uses the NVIDIA native C libraries to read directly from a syscall.