AMD GPU SPM (Streaming Performance Monitor) Technical Reference
2026/4/20 12:57:16 网站建设 项目流程

Part 1: User-Space Perspective

1. What is SPM

SPM (Streaming Performance Monitor) is a hardware streaming performance counter collection mechanism provided by the AMD GPURLC (Run List Controller)unit.

Unlike traditional “start -> stop -> read” snapshot-style performance counters, SPMcontinuously streamsmicroarchitecture counter data (CU occupancy, cache hit rate, VALU/SALU utilization, etc.) into a user-provided memory buffer at hardware clock granularity with near-zero overhead – without stopping the GPU workload.

2. User-Space API Layering

+------------------------------------------------------------------+ | Application / Tool (rocprofiler, custom profiler) | | hsa_amd_spm_acquire() | | hsa_amd_spm_set_dest_buffer() <- double-buffer pattern | | hsa_amd_spm_release() | +---------------------------+--------------------------------------+ | v +------------------------------------------------------------------+ | HSA Runtime (libhsa-runtime64.so) | | hsa_ext_amd.cpp: | | hsa_amd_spm_acquire(agent) | | -> agent->driver().SPMAcquire(node_id) | | amd_kfd_driver.cpp: | | KfdDriver::SPMAcquire(node_id) | | -> HSAKMT_CALL(hsaKmtSPMAcquire(node_id)) | +---------------------------+--------------------------------------+ | v +------------------------------------------------------------------+ | Thunk Layer (libhsakmt.so) | | spm.c: | | hsaKmtSPMAcquire(PreferredNode) | | -> validate_nodeid -> gpu_id | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=ACQUIRE, gpu_id}) | | hsaKmtSPMSetDestBuffer(node, size, timeout, ...) | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=SET_DEST_BUF, ...}) | | hsaKmtSPMRelease(PreferredNode) | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=RELEASE, gpu_id}) | +------------------------------------------------------------------+ | v KFD ioctl (0x84) AMDKFD_IOC_RLC_SPM

3. Three API Semantics

APIKFD OpSemantics
hsa_amd_spm_acquire(agent)SPM_OP_ACQUIREAcquireexclusiveSPM access on the GPU for the calling process. Only one owner at a time.
hsa_amd_spm_set_dest_buffer()SPM_OP_SET_DEST_BUFSet/replace the destination buffer. KFD starts DMA-ing RLC SPM ring data into user buffer. Supports timeout-based wait for the previous buffer to fill.
hsa_amd_spm_release(agent)SPM_OP_RELEASERelease exclusive SPM access. Stops data streaming; remaining data is available upon return.

4. Typical Usage Flow (Double-Buffer Pattern)

User Process KFD / Hardware | | | 1. hsa_amd_spm_acquire(gpu) | | -----ioctl(ACQUIRE)-------------------> | | | Lock SPM for this process | | Set spm_pasid = caller PASID | | Program RLC_SPM_MC_CNTL.VMID | | | 2. Allocate buf_A, buf_B (user memory) | | | | 3. set_dest_buffer(buf_A, size, ...) | | -----ioctl(SET_DEST_BUF)--------------> | | | Start RLC SPM streaming | | DMA counters -> buf_A | | | 4. set_dest_buffer(buf_B, size, t=500) | | -----ioctl(SET_DEST_BUF)--------------> | | (blocks up to 500ms for buf_A fill) | Switch to buf_B | returns: size_copied for buf_A | | | | 5. Parse buf_A while HW fills buf_B | | ... repeat ping-pong ... | | | | 6. hsa_amd_spm_release(gpu) | | -----ioctl(RELEASE)-------------------> | | | Stop SPM, release lock | 7. Parse final buffer |

5. KFD ioctl Data Structures

enumkfd_ioctl_spm_op{KFD_IOCTL_SPM_OP_ACQUIRE,// Acquire exclusive accessKFD_IOCTL_SPM_OP_RELEASE,// Release exclusive accessKFD_IOCTL_SPM_OP_SET_DEST_BUF// Set/replace destination buffer};structkfd_ioctl_spm_args{__u64 dest_buf;// User-space destination buffer address__u32 buf_size;// Buffer size in bytes__u32 op;// Operation (enum kfd_ioctl_spm_op)__u32 timeout;// [in/out] Timeout in ms; updated with remaining__u32 gpu_id;// Target GPU ID__u32 bytes_copied;// [out] Bytes copied to previous buffer__u32 has_data_loss;// [out] Nonzero if ring overflowed};structkfd_ioctl_spm_buffer_header{__u32 version;// 0-23: minor, 24-31: major__u32 bytes_copied;// Per-sub-block data amount__u32 has_data_loss;// Per-sub-block data loss indicator__u32 reserved[5];};

6. Consumer: rocprofiler

rocprofiler (projects/rocprofiler/src/core/session/spm/spm.cpp) is the
primary user-space consumer:

rocprofiler_spm_session | +-- startSpm() | +-- hsa_amd_spm_acquire(gpu_agent) | +-- Submit AQL start packet (configure HW counters) | +-- Allocate 3 x 32MB buffers (triple-buffer) | +-- set_dest_buffer(buf[0]) | +-- spmBufferSetup() thread: ping-pong set_dest_buffer | +-- spmDataParse() thread: decode counter samples | +-- stopSpm() +-- Submit AQL stop packet +-- hsa_amd_spm_release(gpu_agent)

7. User-Space File Inventory

LayerFileRole
HSA Public APIhsa_ext_amd.hDeclarehsa_amd_spm_{acquire,release,set_dest_buffer}
HSA Runtimehsa_ext_amd.cppAPI entry, validate agent, dispatch to driver
HSA Runtimeamd_kfd_driver.cppKfdDriver::SPM{Acquire,Release,SetDestBuffer}
HSA Runtimethunk_loader.h/thunk_loader.cppHSAKMT_DEF/HSAKMT_PFNdynamic symbol load
HSA Runtimehsa_api_trace.cppAPI trace hook registration
HSA Runtimehsa_table_interface.cppHSA table dispatch
Thunkspm.cThree ioctl wrapper functions
Thunkhsakmt.hDeclarehsaKmtSPM*
Thunkkfd_ioctl.hkfd_ioctl_spm_args, ioctl cmd definition
Thunklibhsakmt.verExported symbol table
DXG backenddxg/spm.cppWindows DXG backend stub
VirtIO backendvirtio/hsakmt_virtio_topology.cVirtIO backend implementation
Consumerrocprofiler/spm/spm.cpprocprofiler SPM session management

Part 2: Kernel-Space Perspective

1. SPM Hardware Architecture

+---GPU-Die---------------------------------------------------+ | | | +--------+ +--------+ +--------+ | | | CU 0 | | CU 1 | | CU N | Shader Engines | | +---+----+ +---+----+ +---+----+ | | | | | | | +------+------+------+------+ | | | | | Performance Counter Muxes | | | | | +------v------+ | | | RLC | Run List Controller | | | +-------+ | | | | | SPM | | Streaming Performance Monitor | | | | Engine| | | | | +---+---+ | - Configurable sample interval | | | | | - Ring buffer in GPU-visible memory | | +------+------+ - Per-VMID access control | | | | | +------v------+ | | | MC / MMHUB | Memory Controller | | | (VRAM / | | | | GART) | SPM data -> ring buffer in memory | | +-------------+ | +-------------------------------------------------------------+

Key hardware registers:

RegisterRole
RLC_SPM_MC_CNTLSPM engine master control;RLC_SPM_VMIDfield selects owning VMID
RLC_SPM_RING_RDPTRSPM ring buffer read pointer
RLC_SPM_RING_WRPTRSPM ring buffer write pointer
RLC_SPM_PERFMON_*Performance counter selection and sample interval

2. Kernel Driver Layers

+------------------------------------------------------------------+ | KFD chardev ioctl handler | | AMDKFD_IOC_RLC_SPM (cmd 0x84) | | +-- kfd_ioctl_spm() [out-of-tree / ROCK kernel] | | | | | +-- ACQUIRE: | | | mutex_lock(spm_mutex) | | | if (dev->spm_pasid != 0) return -EBUSY | | | dev->spm_pasid = current->pasid | | | update_spm_vmid(adev, vmid) | | | | | +-- SET_DEST_BUF: | | | if (dev->spm_pasid != current->pasid) -EINVAL | | | configure ring buffer base/size | | | if (timeout) wait_for_completion_timeout() | | | copy bytes_copied, has_data_loss to user | | | | | +-- RELEASE: | | if (dev->spm_pasid != current->pasid) -EINVAL | | stop SPM engine | | dev->spm_pasid = 0 | | update_spm_vmid(adev, 0xf) // reset to default | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | amdgpu GFX IP callbacks (per-generation) | | gfx_v9_0.c / gfx_v10_0.c / gfx_v11_0.c / gfx_v12_1.c: | | .update_spm_vmid = gfx_vN_0_update_spm_vmid | | -> WREG32(RLC_SPM_MC_CNTL, vmid) | | .init_spm_golden = gfx_vN_0_init_spm_golden | | -> Program golden settings for SPM engine | +------------------------------------------------------------------+

3. update_spm_vmid Implementation (GFX9 Example)

staticvoidgfx_v9_0_update_spm_vmid(structamdgpu_device*adev,intxcc_id,structamdgpu_ring*ring,unsignedintvmid){amdgpu_gfx_off_ctrl(adev,false);// Disable GFXOFF power-save// Read-modify-write RLC_SPM_MC_CNTL registerdata=RREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL);data&=~RLC_SPM_MC_CNTL__RLC_SPM_VMID_MASK;data|=vmid<<RLC_SPM_MC_CNTL__RLC_SPM_VMID__SHIFT;WREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL,data);amdgpu_gfx_off_ctrl(adev,true);// Re-enable GFXOFF}

Key points:

  • Must disable GFXOFF before accessing RLC registers
  • VMID field determines which process context triggers SPM data capture
  • SRIOV usesNO_KIQvariant to avoid KIQ ring deadlock

4. spm_pasid Mutual Exclusion Model

kfd_dev (per-GPU): +-- spm_pasid: unsigned int // 0 = no owner, nonzero = owning PASID +-- spm_mutex: mutex // protects spm_pasid and HW state ACQUIRE: lock(spm_mutex) if spm_pasid != 0 -> -EBUSY (another process owns it) spm_pasid = caller_pasid program HW VMID unlock(spm_mutex) RELEASE: lock(spm_mutex) if spm_pasid != caller_pasid -> -EINVAL stop HW, spm_pasid = 0 update_spm_vmid(adev, 0xf) // reset unlock(spm_mutex)

SPM is a globally exclusive resource– each GPU can have only one process
holding SPM at a time. This is a hardware limitation: the RLC SPM engine has
only one ring buffer and one VMID slot.

5. Data Flow: Hardware to User-Space

Hardware Kernel User -------- ------ ---- CU perf counters --> RLC SPM engine | | (HW auto-sample at configured interval) v RLC SPM Ring Buffer (GPU-visible memory, kernel-managed) | | (KFD copies via CPU or SDMA on SET_DEST_BUF) v kfd_ioctl_spm: wait_for_completion_timeout() copy_to_user(bytes_copied, has_data_loss) | v User dest_buf (user-allocated, CPU-accessible) | v rocprofiler parses SPM samples -> per-counter time-series data

6. Upstream vs. Out-of-Tree Status

ComponentUpstream (drm-next)ROCK / DKMS
RLC_SPM_MC_CNTLregister defsYesYes
update_spm_vmid()GFX callbacksYesYes
init_spm_golden()golden regsYesYes
spm_pasidinkfd_priv.hYes (field only)Yes
AMDKFD_IOC_RLC_SPMioctl handlerNoYes
kfd_ioctl_spm_argsUAPI headerNoYes (libhsakmt ships its own)

Note:The SPM ioctl (cmd 0x84) currently exists only in AMD’s out-of-tree ROCK/amdgpu-dkms kernel. It has not been upstreamed to mainline Linux. The upstream kernel only has low-level hardware register interfaces (update_spm_vmid,init_spm_golden), not the user-space ioctl entry point.

7. SPM vs. Traditional Performance Counters

FeatureTraditional PMC (Snapshot)SPM (Streaming)
Sampling modestart -> stop -> readContinuous HW auto-sample
OverheadCP/RLC interaction per readNear-zero; HW auto-DMA
Time resolutionPer-dispatch granularityConfigurable sample period (us-level)
Data volumeTens of counter values per readContinuous time-series stream
ExclusivityMulti-processes can read different countersSingle-process exclusive
Typical use caserocprofcounter moderocprofSPM mode, temporal analysis

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询