Part 1: User-Space Perspective
1. What is SPM
SPM (Streaming Performance Monitor) is a hardware streaming performance counter collection mechanism provided by the AMD GPURLC (Run List Controller)unit.
Unlike traditional “start -> stop -> read” snapshot-style performance counters, SPMcontinuously streamsmicroarchitecture counter data (CU occupancy, cache hit rate, VALU/SALU utilization, etc.) into a user-provided memory buffer at hardware clock granularity with near-zero overhead – without stopping the GPU workload.
2. User-Space API Layering
+------------------------------------------------------------------+ | Application / Tool (rocprofiler, custom profiler) | | hsa_amd_spm_acquire() | | hsa_amd_spm_set_dest_buffer() <- double-buffer pattern | | hsa_amd_spm_release() | +---------------------------+--------------------------------------+ | v +------------------------------------------------------------------+ | HSA Runtime (libhsa-runtime64.so) | | hsa_ext_amd.cpp: | | hsa_amd_spm_acquire(agent) | | -> agent->driver().SPMAcquire(node_id) | | amd_kfd_driver.cpp: | | KfdDriver::SPMAcquire(node_id) | | -> HSAKMT_CALL(hsaKmtSPMAcquire(node_id)) | +---------------------------+--------------------------------------+ | v +------------------------------------------------------------------+ | Thunk Layer (libhsakmt.so) | | spm.c: | | hsaKmtSPMAcquire(PreferredNode) | | -> validate_nodeid -> gpu_id | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=ACQUIRE, gpu_id}) | | hsaKmtSPMSetDestBuffer(node, size, timeout, ...) | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=SET_DEST_BUF, ...}) | | hsaKmtSPMRelease(PreferredNode) | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=RELEASE, gpu_id}) | +------------------------------------------------------------------+ | v KFD ioctl (0x84) AMDKFD_IOC_RLC_SPM3. Three API Semantics
| API | KFD Op | Semantics |
|---|---|---|
hsa_amd_spm_acquire(agent) | SPM_OP_ACQUIRE | AcquireexclusiveSPM access on the GPU for the calling process. Only one owner at a time. |
hsa_amd_spm_set_dest_buffer() | SPM_OP_SET_DEST_BUF | Set/replace the destination buffer. KFD starts DMA-ing RLC SPM ring data into user buffer. Supports timeout-based wait for the previous buffer to fill. |
hsa_amd_spm_release(agent) | SPM_OP_RELEASE | Release exclusive SPM access. Stops data streaming; remaining data is available upon return. |
4. Typical Usage Flow (Double-Buffer Pattern)
User Process KFD / Hardware | | | 1. hsa_amd_spm_acquire(gpu) | | -----ioctl(ACQUIRE)-------------------> | | | Lock SPM for this process | | Set spm_pasid = caller PASID | | Program RLC_SPM_MC_CNTL.VMID | | | 2. Allocate buf_A, buf_B (user memory) | | | | 3. set_dest_buffer(buf_A, size, ...) | | -----ioctl(SET_DEST_BUF)--------------> | | | Start RLC SPM streaming | | DMA counters -> buf_A | | | 4. set_dest_buffer(buf_B, size, t=500) | | -----ioctl(SET_DEST_BUF)--------------> | | (blocks up to 500ms for buf_A fill) | Switch to buf_B | returns: size_copied for buf_A | | | | 5. Parse buf_A while HW fills buf_B | | ... repeat ping-pong ... | | | | 6. hsa_amd_spm_release(gpu) | | -----ioctl(RELEASE)-------------------> | | | Stop SPM, release lock | 7. Parse final buffer |5. KFD ioctl Data Structures
enumkfd_ioctl_spm_op{KFD_IOCTL_SPM_OP_ACQUIRE,// Acquire exclusive accessKFD_IOCTL_SPM_OP_RELEASE,// Release exclusive accessKFD_IOCTL_SPM_OP_SET_DEST_BUF// Set/replace destination buffer};structkfd_ioctl_spm_args{__u64 dest_buf;// User-space destination buffer address__u32 buf_size;// Buffer size in bytes__u32 op;// Operation (enum kfd_ioctl_spm_op)__u32 timeout;// [in/out] Timeout in ms; updated with remaining__u32 gpu_id;// Target GPU ID__u32 bytes_copied;// [out] Bytes copied to previous buffer__u32 has_data_loss;// [out] Nonzero if ring overflowed};structkfd_ioctl_spm_buffer_header{__u32 version;// 0-23: minor, 24-31: major__u32 bytes_copied;// Per-sub-block data amount__u32 has_data_loss;// Per-sub-block data loss indicator__u32 reserved[5];};6. Consumer: rocprofiler
rocprofiler (projects/rocprofiler/src/core/session/spm/spm.cpp) is the
primary user-space consumer:
rocprofiler_spm_session | +-- startSpm() | +-- hsa_amd_spm_acquire(gpu_agent) | +-- Submit AQL start packet (configure HW counters) | +-- Allocate 3 x 32MB buffers (triple-buffer) | +-- set_dest_buffer(buf[0]) | +-- spmBufferSetup() thread: ping-pong set_dest_buffer | +-- spmDataParse() thread: decode counter samples | +-- stopSpm() +-- Submit AQL stop packet +-- hsa_amd_spm_release(gpu_agent)7. User-Space File Inventory
| Layer | File | Role |
|---|---|---|
| HSA Public API | hsa_ext_amd.h | Declarehsa_amd_spm_{acquire,release,set_dest_buffer} |
| HSA Runtime | hsa_ext_amd.cpp | API entry, validate agent, dispatch to driver |
| HSA Runtime | amd_kfd_driver.cpp | KfdDriver::SPM{Acquire,Release,SetDestBuffer} |
| HSA Runtime | thunk_loader.h/thunk_loader.cpp | HSAKMT_DEF/HSAKMT_PFNdynamic symbol load |
| HSA Runtime | hsa_api_trace.cpp | API trace hook registration |
| HSA Runtime | hsa_table_interface.cpp | HSA table dispatch |
| Thunk | spm.c | Three ioctl wrapper functions |
| Thunk | hsakmt.h | DeclarehsaKmtSPM* |
| Thunk | kfd_ioctl.h | kfd_ioctl_spm_args, ioctl cmd definition |
| Thunk | libhsakmt.ver | Exported symbol table |
| DXG backend | dxg/spm.cpp | Windows DXG backend stub |
| VirtIO backend | virtio/hsakmt_virtio_topology.c | VirtIO backend implementation |
| Consumer | rocprofiler/spm/spm.cpp | rocprofiler SPM session management |
Part 2: Kernel-Space Perspective
1. SPM Hardware Architecture
+---GPU-Die---------------------------------------------------+ | | | +--------+ +--------+ +--------+ | | | CU 0 | | CU 1 | | CU N | Shader Engines | | +---+----+ +---+----+ +---+----+ | | | | | | | +------+------+------+------+ | | | | | Performance Counter Muxes | | | | | +------v------+ | | | RLC | Run List Controller | | | +-------+ | | | | | SPM | | Streaming Performance Monitor | | | | Engine| | | | | +---+---+ | - Configurable sample interval | | | | | - Ring buffer in GPU-visible memory | | +------+------+ - Per-VMID access control | | | | | +------v------+ | | | MC / MMHUB | Memory Controller | | | (VRAM / | | | | GART) | SPM data -> ring buffer in memory | | +-------------+ | +-------------------------------------------------------------+Key hardware registers:
| Register | Role |
|---|---|
RLC_SPM_MC_CNTL | SPM engine master control;RLC_SPM_VMIDfield selects owning VMID |
RLC_SPM_RING_RDPTR | SPM ring buffer read pointer |
RLC_SPM_RING_WRPTR | SPM ring buffer write pointer |
RLC_SPM_PERFMON_* | Performance counter selection and sample interval |
2. Kernel Driver Layers
+------------------------------------------------------------------+ | KFD chardev ioctl handler | | AMDKFD_IOC_RLC_SPM (cmd 0x84) | | +-- kfd_ioctl_spm() [out-of-tree / ROCK kernel] | | | | | +-- ACQUIRE: | | | mutex_lock(spm_mutex) | | | if (dev->spm_pasid != 0) return -EBUSY | | | dev->spm_pasid = current->pasid | | | update_spm_vmid(adev, vmid) | | | | | +-- SET_DEST_BUF: | | | if (dev->spm_pasid != current->pasid) -EINVAL | | | configure ring buffer base/size | | | if (timeout) wait_for_completion_timeout() | | | copy bytes_copied, has_data_loss to user | | | | | +-- RELEASE: | | if (dev->spm_pasid != current->pasid) -EINVAL | | stop SPM engine | | dev->spm_pasid = 0 | | update_spm_vmid(adev, 0xf) // reset to default | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | amdgpu GFX IP callbacks (per-generation) | | gfx_v9_0.c / gfx_v10_0.c / gfx_v11_0.c / gfx_v12_1.c: | | .update_spm_vmid = gfx_vN_0_update_spm_vmid | | -> WREG32(RLC_SPM_MC_CNTL, vmid) | | .init_spm_golden = gfx_vN_0_init_spm_golden | | -> Program golden settings for SPM engine | +------------------------------------------------------------------+3. update_spm_vmid Implementation (GFX9 Example)
staticvoidgfx_v9_0_update_spm_vmid(structamdgpu_device*adev,intxcc_id,structamdgpu_ring*ring,unsignedintvmid){amdgpu_gfx_off_ctrl(adev,false);// Disable GFXOFF power-save// Read-modify-write RLC_SPM_MC_CNTL registerdata=RREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL);data&=~RLC_SPM_MC_CNTL__RLC_SPM_VMID_MASK;data|=vmid<<RLC_SPM_MC_CNTL__RLC_SPM_VMID__SHIFT;WREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL,data);amdgpu_gfx_off_ctrl(adev,true);// Re-enable GFXOFF}Key points:
- Must disable GFXOFF before accessing RLC registers
- VMID field determines which process context triggers SPM data capture
- SRIOV uses
NO_KIQvariant to avoid KIQ ring deadlock
4. spm_pasid Mutual Exclusion Model
kfd_dev (per-GPU): +-- spm_pasid: unsigned int // 0 = no owner, nonzero = owning PASID +-- spm_mutex: mutex // protects spm_pasid and HW state ACQUIRE: lock(spm_mutex) if spm_pasid != 0 -> -EBUSY (another process owns it) spm_pasid = caller_pasid program HW VMID unlock(spm_mutex) RELEASE: lock(spm_mutex) if spm_pasid != caller_pasid -> -EINVAL stop HW, spm_pasid = 0 update_spm_vmid(adev, 0xf) // reset unlock(spm_mutex)SPM is a globally exclusive resource– each GPU can have only one process
holding SPM at a time. This is a hardware limitation: the RLC SPM engine has
only one ring buffer and one VMID slot.
5. Data Flow: Hardware to User-Space
Hardware Kernel User -------- ------ ---- CU perf counters --> RLC SPM engine | | (HW auto-sample at configured interval) v RLC SPM Ring Buffer (GPU-visible memory, kernel-managed) | | (KFD copies via CPU or SDMA on SET_DEST_BUF) v kfd_ioctl_spm: wait_for_completion_timeout() copy_to_user(bytes_copied, has_data_loss) | v User dest_buf (user-allocated, CPU-accessible) | v rocprofiler parses SPM samples -> per-counter time-series data6. Upstream vs. Out-of-Tree Status
| Component | Upstream (drm-next) | ROCK / DKMS |
|---|---|---|
RLC_SPM_MC_CNTLregister defs | Yes | Yes |
update_spm_vmid()GFX callbacks | Yes | Yes |
init_spm_golden()golden regs | Yes | Yes |
spm_pasidinkfd_priv.h | Yes (field only) | Yes |
AMDKFD_IOC_RLC_SPMioctl handler | No | Yes |
kfd_ioctl_spm_argsUAPI header | No | Yes (libhsakmt ships its own) |
Note:The SPM ioctl (cmd 0x84) currently exists only in AMD’s out-of-tree ROCK/amdgpu-dkms kernel. It has not been upstreamed to mainline Linux. The upstream kernel only has low-level hardware register interfaces (
update_spm_vmid,init_spm_golden), not the user-space ioctl entry point.
7. SPM vs. Traditional Performance Counters
| Feature | Traditional PMC (Snapshot) | SPM (Streaming) |
|---|---|---|
| Sampling mode | start -> stop -> read | Continuous HW auto-sample |
| Overhead | CP/RLC interaction per read | Near-zero; HW auto-DMA |
| Time resolution | Per-dispatch granularity | Configurable sample period (us-level) |
| Data volume | Tens of counter values per read | Continuous time-series stream |
| Exclusivity | Multi-processes can read different counters | Single-process exclusive |
| Typical use case | rocprofcounter mode | rocprofSPM mode, temporal analysis |