AMD GPU SPM (Streaming Performance Monitor) Technical Reference-酒店常州论坛

Part 1: User-Space Perspective

1. What is SPM

SPM (Streaming Performance Monitor) is a hardware streaming performance counter collection mechanism provided by the AMD GPURLC (Run List Controller)unit.

Unlike traditional “start -> stop -> read” snapshot-style performance counters, SPMcontinuously streamsmicroarchitecture counter data (CU occupancy, cache hit rate, VALU/SALU utilization, etc.) into a user-provided memory buffer at hardware clock granularity with near-zero overhead – without stopping the GPU workload.

2. User-Space API Layering

+------------------------------------------------------------------+ | Application / Tool (rocprofiler, custom profiler) | | hsa_amd_spm_acquire() | | hsa_amd_spm_set_dest_buffer() <- double-buffer pattern | | hsa_amd_spm_release() | +---------------------------+--------------------------------------+ | v +------------------------------------------------------------------+ | HSA Runtime (libhsa-runtime64.so) | | hsa_ext_amd.cpp: | | hsa_amd_spm_acquire(agent) | | -> agent->driver().SPMAcquire(node_id) | | amd_kfd_driver.cpp: | | KfdDriver::SPMAcquire(node_id) | | -> HSAKMT_CALL(hsaKmtSPMAcquire(node_id)) | +---------------------------+--------------------------------------+ | v +------------------------------------------------------------------+ | Thunk Layer (libhsakmt.so) | | spm.c: | | hsaKmtSPMAcquire(PreferredNode) | | -> validate_nodeid -> gpu_id | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=ACQUIRE, gpu_id}) | | hsaKmtSPMSetDestBuffer(node, size, timeout, ...) | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=SET_DEST_BUF, ...}) | | hsaKmtSPMRelease(PreferredNode) | | -> ioctl(fd, AMDKFD_IOC_RLC_SPM, {op=RELEASE, gpu_id}) | +------------------------------------------------------------------+ | v KFD ioctl (0x84) AMDKFD_IOC_RLC_SPM

3. Three API Semantics

API	KFD Op	Semantics
`hsa_amd_spm_acquire(agent)`	`SPM_OP_ACQUIRE`	AcquireexclusiveSPM access on the GPU for the calling process. Only one owner at a time.
`hsa_amd_spm_set_dest_buffer()`	`SPM_OP_SET_DEST_BUF`	Set/replace the destination buffer. KFD starts DMA-ing RLC SPM ring data into user buffer. Supports timeout-based wait for the previous buffer to fill.
`hsa_amd_spm_release(agent)`	`SPM_OP_RELEASE`	Release exclusive SPM access. Stops data streaming; remaining data is available upon return.

4. Typical Usage Flow (Double-Buffer Pattern)

User Process KFD / Hardware | | | 1. hsa_amd_spm_acquire(gpu) | | -----ioctl(ACQUIRE)-------------------> | | | Lock SPM for this process | | Set spm_pasid = caller PASID | | Program RLC_SPM_MC_CNTL.VMID | | | 2. Allocate buf_A, buf_B (user memory) | | | | 3. set_dest_buffer(buf_A, size, ...) | | -----ioctl(SET_DEST_BUF)--------------> | | | Start RLC SPM streaming | | DMA counters -> buf_A | | | 4. set_dest_buffer(buf_B, size, t=500) | | -----ioctl(SET_DEST_BUF)--------------> | | (blocks up to 500ms for buf_A fill) | Switch to buf_B | returns: size_copied for buf_A | | | | 5. Parse buf_A while HW fills buf_B | | ... repeat ping-pong ... | | | | 6. hsa_amd_spm_release(gpu) | | -----ioctl(RELEASE)-------------------> | | | Stop SPM, release lock | 7. Parse final buffer |

5. KFD ioctl Data Structures

enumkfd_ioctl_spm_op{KFD_IOCTL_SPM_OP_ACQUIRE,// Acquire exclusive accessKFD_IOCTL_SPM_OP_RELEASE,// Release exclusive accessKFD_IOCTL_SPM_OP_SET_DEST_BUF// Set/replace destination buffer};structkfd_ioctl_spm_args{__u64 dest_buf;// User-space destination buffer address__u32 buf_size;// Buffer size in bytes__u32 op;// Operation (enum kfd_ioctl_spm_op)__u32 timeout;// [in/out] Timeout in ms; updated with remaining__u32 gpu_id;// Target GPU ID__u32 bytes_copied;// [out] Bytes copied to previous buffer__u32 has_data_loss;// [out] Nonzero if ring overflowed};structkfd_ioctl_spm_buffer_header{__u32 version;// 0-23: minor, 24-31: major__u32 bytes_copied;// Per-sub-block data amount__u32 has_data_loss;// Per-sub-block data loss indicator__u32 reserved[5];};

6. Consumer: rocprofiler

rocprofiler (projects/rocprofiler/src/core/session/spm/spm.cpp) is the
primary user-space consumer:

rocprofiler_spm_session | +-- startSpm() | +-- hsa_amd_spm_acquire(gpu_agent) | +-- Submit AQL start packet (configure HW counters) | +-- Allocate 3 x 32MB buffers (triple-buffer) | +-- set_dest_buffer(buf[0]) | +-- spmBufferSetup() thread: ping-pong set_dest_buffer | +-- spmDataParse() thread: decode counter samples | +-- stopSpm() +-- Submit AQL stop packet +-- hsa_amd_spm_release(gpu_agent)

7. User-Space File Inventory

Layer	File	Role
HSA Public API	`hsa_ext_amd.h`	Declare`hsa_amd_spm_{acquire,release,set_dest_buffer}`
HSA Runtime	`hsa_ext_amd.cpp`	API entry, validate agent, dispatch to driver
HSA Runtime	`amd_kfd_driver.cpp`	`KfdDriver::SPM{Acquire,Release,SetDestBuffer}`
HSA Runtime	`thunk_loader.h`/`thunk_loader.cpp`	`HSAKMT_DEF`/`HSAKMT_PFN`dynamic symbol load
HSA Runtime	`hsa_api_trace.cpp`	API trace hook registration
HSA Runtime	`hsa_table_interface.cpp`	HSA table dispatch
Thunk	`spm.c`	Three ioctl wrapper functions
Thunk	`hsakmt.h`	Declare`hsaKmtSPM*`
Thunk	`kfd_ioctl.h`	`kfd_ioctl_spm_args`, ioctl cmd definition
Thunk	`libhsakmt.ver`	Exported symbol table
DXG backend	`dxg/spm.cpp`	Windows DXG backend stub
VirtIO backend	`virtio/hsakmt_virtio_topology.c`	VirtIO backend implementation
Consumer	`rocprofiler/spm/spm.cpp`	rocprofiler SPM session management

Part 2: Kernel-Space Perspective

1. SPM Hardware Architecture

+---GPU-Die---------------------------------------------------+ | | | +--------+ +--------+ +--------+ | | | CU 0 | | CU 1 | | CU N | Shader Engines | | +---+----+ +---+----+ +---+----+ | | | | | | | +------+------+------+------+ | | | | | Performance Counter Muxes | | | | | +------v------+ | | | RLC | Run List Controller | | | +-------+ | | | | | SPM | | Streaming Performance Monitor | | | | Engine| | | | | +---+---+ | - Configurable sample interval | | | | | - Ring buffer in GPU-visible memory | | +------+------+ - Per-VMID access control | | | | | +------v------+ | | | MC / MMHUB | Memory Controller | | | (VRAM / | | | | GART) | SPM data -> ring buffer in memory | | +-------------+ | +-------------------------------------------------------------+

Key hardware registers:

Register	Role
`RLC_SPM_MC_CNTL`	SPM engine master control;`RLC_SPM_VMID`field selects owning VMID
`RLC_SPM_RING_RDPTR`	SPM ring buffer read pointer
`RLC_SPM_RING_WRPTR`	SPM ring buffer write pointer
`RLC_SPM_PERFMON_*`	Performance counter selection and sample interval

2. Kernel Driver Layers

+------------------------------------------------------------------+ | KFD chardev ioctl handler | | AMDKFD_IOC_RLC_SPM (cmd 0x84) | | +-- kfd_ioctl_spm() [out-of-tree / ROCK kernel] | | | | | +-- ACQUIRE: | | | mutex_lock(spm_mutex) | | | if (dev->spm_pasid != 0) return -EBUSY | | | dev->spm_pasid = current->pasid | | | update_spm_vmid(adev, vmid) | | | | | +-- SET_DEST_BUF: | | | if (dev->spm_pasid != current->pasid) -EINVAL | | | configure ring buffer base/size | | | if (timeout) wait_for_completion_timeout() | | | copy bytes_copied, has_data_loss to user | | | | | +-- RELEASE: | | if (dev->spm_pasid != current->pasid) -EINVAL | | stop SPM engine | | dev->spm_pasid = 0 | | update_spm_vmid(adev, 0xf) // reset to default | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | amdgpu GFX IP callbacks (per-generation) | | gfx_v9_0.c / gfx_v10_0.c / gfx_v11_0.c / gfx_v12_1.c: | | .update_spm_vmid = gfx_vN_0_update_spm_vmid | | -> WREG32(RLC_SPM_MC_CNTL, vmid) | | .init_spm_golden = gfx_vN_0_init_spm_golden | | -> Program golden settings for SPM engine | +------------------------------------------------------------------+

3. update_spm_vmid Implementation (GFX9 Example)

staticvoidgfx_v9_0_update_spm_vmid(structamdgpu_device*adev,intxcc_id,structamdgpu_ring*ring,unsignedintvmid){amdgpu_gfx_off_ctrl(adev,false);// Disable GFXOFF power-save// Read-modify-write RLC_SPM_MC_CNTL registerdata=RREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL);data&=~RLC_SPM_MC_CNTL__RLC_SPM_VMID_MASK;data|=vmid<<RLC_SPM_MC_CNTL__RLC_SPM_VMID__SHIFT;WREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL,data);amdgpu_gfx_off_ctrl(adev,true);// Re-enable GFXOFF}

Key points:

Must disable GFXOFF before accessing RLC registers
VMID field determines which process context triggers SPM data capture
SRIOV usesNO_KIQvariant to avoid KIQ ring deadlock

4. spm_pasid Mutual Exclusion Model

kfd_dev (per-GPU): +-- spm_pasid: unsigned int // 0 = no owner, nonzero = owning PASID +-- spm_mutex: mutex // protects spm_pasid and HW state ACQUIRE: lock(spm_mutex) if spm_pasid != 0 -> -EBUSY (another process owns it) spm_pasid = caller_pasid program HW VMID unlock(spm_mutex) RELEASE: lock(spm_mutex) if spm_pasid != caller_pasid -> -EINVAL stop HW, spm_pasid = 0 update_spm_vmid(adev, 0xf) // reset unlock(spm_mutex)

SPM is a globally exclusive resource– each GPU can have only one process
holding SPM at a time. This is a hardware limitation: the RLC SPM engine has
only one ring buffer and one VMID slot.

5. Data Flow: Hardware to User-Space

Hardware Kernel User -------- ------ ---- CU perf counters --> RLC SPM engine | | (HW auto-sample at configured interval) v RLC SPM Ring Buffer (GPU-visible memory, kernel-managed) | | (KFD copies via CPU or SDMA on SET_DEST_BUF) v kfd_ioctl_spm: wait_for_completion_timeout() copy_to_user(bytes_copied, has_data_loss) | v User dest_buf (user-allocated, CPU-accessible) | v rocprofiler parses SPM samples -> per-counter time-series data

6. Upstream vs. Out-of-Tree Status

Component	Upstream (drm-next)	ROCK / DKMS
`RLC_SPM_MC_CNTL`register defs	Yes	Yes
`update_spm_vmid()`GFX callbacks	Yes	Yes
`init_spm_golden()`golden regs	Yes	Yes
`spm_pasid`in`kfd_priv.h`	Yes (field only)	Yes
`AMDKFD_IOC_RLC_SPM`ioctl handler	No	Yes
`kfd_ioctl_spm_args`UAPI header	No	Yes (libhsakmt ships its own)

Note:The SPM ioctl (cmd 0x84) currently exists only in AMD’s out-of-tree ROCK/amdgpu-dkms kernel. It has not been upstreamed to mainline Linux. The upstream kernel only has low-level hardware register interfaces (update_spm_vmid,init_spm_golden), not the user-space ioctl entry point.

7. SPM vs. Traditional Performance Counters

Feature	Traditional PMC (Snapshot)	SPM (Streaming)
Sampling mode	start -> stop -> read	Continuous HW auto-sample
Overhead	CP/RLC interaction per read	Near-zero; HW auto-DMA
Time resolution	Per-dispatch granularity	Configurable sample period (us-level)
Data volume	Tens of counter values per read	Continuous time-series stream
Exclusivity	Multi-processes can read different counters	Single-process exclusive
Typical use case	`rocprof`counter mode	`rocprof`SPM mode, temporal analysis

企业官网建设流程全解析

Part 1: User-Space Perspective

1. What is SPM

2. User-Space API Layering

3. Three API Semantics

4. Typical Usage Flow (Double-Buffer Pattern)

5. KFD ioctl Data Structures

6. Consumer: rocprofiler

7. User-Space File Inventory

Part 2: Kernel-Space Perspective

1. SPM Hardware Architecture

2. Kernel Driver Layers

3. update_spm_vmid Implementation (GFX9 Example)

4. spm_pasid Mutual Exclusion Model

5. Data Flow: Hardware to User-Space

6. Upstream vs. Out-of-Tree Status

7. SPM vs. Traditional Performance Counters

热门文章

文章分类

标签云

需要专业的网站建设服务？

企业官网建设流程全解析

Part 1: User-Space Perspective

1. What is SPM

2. User-Space API Layering

3. Three API Semantics

4. Typical Usage Flow (Double-Buffer Pattern)

5. KFD ioctl Data Structures

6. Consumer: rocprofiler

7. User-Space File Inventory

Part 2: Kernel-Space Perspective

1. SPM Hardware Architecture

2. Kernel Driver Layers

3. update_spm_vmid Implementation (GFX9 Example)

4. spm_pasid Mutual Exclusion Model

5. Data Flow: Hardware to User-Space

6. Upstream vs. Out-of-Tree Status

7. SPM vs. Traditional Performance Counters

热门文章

文章分类

标签云

相关文章

Cats Blender插件终极指南：VRChat模型优化效率提升300%

Bilibili视频下载神器：3分钟掌握B站高清视频批量下载技巧

5分钟搞定视频字幕：VideoSrt开源字幕生成工具终极指南

需要专业的网站建设服务？