CANN/EasyAsc DSL约简约束
2026/6/3 21:48:30 网站建设 项目流程

Reduction Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when a kernel needs row-wise reductions, normalization, softmax, or quantization statistics inside a@vfstage.

Goal

Choose the right reduction idiom so that:

  • the reduction stays in registers when possible to reduce UB traffic
  • multi-pass flows keep intermediate state persistent across tile passes
  • streaming stats follow the correct update order for numerical stability

1. Row sum normalization

Single-pass pattern:

  • sum = RegList.cadd()
  • dup(sum)then divide row registers

Files to study:

  • agent/example/kernels/a5/matmul_rowwise_norm.py

2. Tiled softmax (three-pass)

Use when the fullSdimension does not fit in one tile.

  • pass 1: store float logits and track row max withcmax()+dup()
  • pass 2: reload logits, subtract duplicated row max,exp(), accumulate row sum, and store exponentials
  • pass 3: reload exponentials and divide by duplicated row sum before the final cast

3. Streaming softmax-style MLA stats

Single-pass online update per tile:

  • curr_max = max(prev_max, qk_tile.amax(-1))
  • p_tile = exp(qk_tile - curr_max)
  • row_sum = prev_sum * exp(prev_max - curr_max) + p_tile.sum(-1)
  • if the tile must be materialized in fp8, updaterow_sumfrom the float tile first and cast only after the float reduction is complete

Streaming MLA with final normalizedoutput:[B,H,Dn]:

  • rescale the running numerator byexp(prev_max - curr_max)before adding the current block contribution
  • keep the numerator in float across all blocks
  • applyoutput /= row_sumonly once after the loop
  • if the value path intentionally usesp.half().float(), updaterow_sumfrom the floatexp(...)tile before the cast, then cast only for the downstream cube consume path

Files to study:

  • agent/example/kernels/a5/test_mla_entire.py
  • agent/example/kernels/a2/flash_attn_full.py

4. Absmax scaling and quantization

  • abs()thencmax()for row/block scalar
  • divide row by duplicated scalar
  • optionally emit scale via duplicated scalar to 64-lane row

Files to study:

  • agent/example/kernels/a5/matmul_chunk_absmax_norm128.py
  • agent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py

5. Two-pass rowwise normalization

Use when theNdimension is too large for single-pass normalization.

  • pass 1: matmul + row-stat accumulation into persistent UB buffer + temporary output write
  • pass 2: reload temporary output and normalize by accumulated stats

The row-stat buffer must persist in UB across allN-tile passes within oneMtile.

Files to study:

  • agent/example/kernels/a5/matmul_rowwise_norm_large_nk.py
  • agent/example/kernels/a5/matmul_rowwise_l2_norm.py

6. Running state across inner-loop iterations (a2 pattern)

When a kernel must accumulate a statistic (like running max for online softmax) across tiles in the inner loop, the DSL has no conditional logic (if first_iteration).

Solution: identity-element initialization + unconditional update.

Initialize the running buffer to the identity element of the accumulation operation before the inner loop usingdup, then apply the update unconditionally every iteration:

AccumulationIdentity elementUpdate operationExample
running maxneg_large(finite negative sentinel)vmax(running, running, tile)online softmax max tracking
running sum0.0add(running, running, tile)online softmax sum accumulation
running product1.0mul(running, running, tile)decay chain products

Why this works: withneg_largechosen below every valid tile value,max(neg_large, x) = x,0 + x = x,1 × x = x— the first iteration naturally produces the correct initial value without special-casing.

Choosing the right tensor format for the update operation:

The update (vmax,add, etc.) is a binary element-wise operation between two UB tensors. Both must have matching stride layouts, and the operation must cover all intended elements.

On a2 without registers,cmaxoutputs dense scalars in[M, 1]format. Operate on this format directly:

# Correct: vmax on [64, 1] covers all 64 rows vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # both [HALF_M, 1]

Do NOT broadcast to[M, 8]first and then attemptvmaxbetween two[M, 8]buffers —blk_stride=0makes that operation cover only 1/8 of the elements. Seeagent/references/constraints/vec-reduction-a2.mdsection 5 for the detailed proof.

Lifetime and reset rules:

  • The running buffer must be reset at the beginning of eachouterloop iteration (each new M-tile gets fresh running stats)
  • It persists across allinnerloop iterations (N-tiles accumulate into it)
  • UB is per-sub-block and persistent — no special lifetime management needed

Complete pattern:

ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) with auto_sync(): for gmt in range(mt_begin, mt_end): # outer: M-tiles dup(ub_rmax_s, neg_large) # reset per M-tile with the running-max identity for nt in range(0, tiles_n): # inner: N-tiles # ... compute tile, get ub_max_s via cmax ... vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # accumulate brcb(ub_max, ub_rmax_s, ...) # broadcast AFTER update # ... subtract, exp, store ...

On a5 withReg/RegList, the same pattern uses register-leveldupandvmaxsinstead of UB-level operations. The identity-element principle is the same.

Files to study:

  • agent/example/kernels/a2/flash_attn_score_iter.py— validated running max pattern on a2
  • agent/example/kernels/a5/test_mla_entire.py— streamed running max/sum in a5 register pipeline

7. a2 vec reduction (no registers)

On a2,Reg/RegListare not available. Reductions use UB-to-UB operations:

  • cmax/caddreduce 64 elements to 1 scalar per repeat (dense output)
  • The dense output must be broadcast viabrcbbefore use insub/div
  • For buffers wider than 64 columns, first merge withvmax/addto 64

Complete pattern:vmax → cmax → brcb → sub(sliced for repeat alignment)

Read:agent/references/constraints/vec-reduction-a2.md

8. General rules

  • keep reductions in registers when possible (a5 with@vf)
  • on a2, usecmax → brcbUB-to-UB pattern instead
  • usedup()to broadcast a scalar reduction result back to full-row width before element-wise operations
  • for multi-pass flows, decide upfront which UB buffers persist across passes and which are reused

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询