CANN/EasyAsc DSL约简约束-酒店常州论坛

Reduction Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when a kernel needs row-wise reductions, normalization, softmax, or quantization statistics inside a@vfstage.

Goal

Choose the right reduction idiom so that:

the reduction stays in registers when possible to reduce UB traffic
multi-pass flows keep intermediate state persistent across tile passes
streaming stats follow the correct update order for numerical stability

1. Row sum normalization

Single-pass pattern:

sum = RegList.cadd()
dup(sum)then divide row registers

Files to study:

agent/example/kernels/a5/matmul_rowwise_norm.py

2. Tiled softmax (three-pass)

Use when the fullSdimension does not fit in one tile.

pass 1: store float logits and track row max withcmax()+dup()
pass 2: reload logits, subtract duplicated row max,exp(), accumulate row sum, and store exponentials
pass 3: reload exponentials and divide by duplicated row sum before the final cast

3. Streaming softmax-style MLA stats

Single-pass online update per tile:

curr_max = max(prev_max, qk_tile.amax(-1))
p_tile = exp(qk_tile - curr_max)
row_sum = prev_sum * exp(prev_max - curr_max) + p_tile.sum(-1)
if the tile must be materialized in fp8, updaterow_sumfrom the float tile first and cast only after the float reduction is complete

Streaming MLA with final normalizedoutput:[B,H,Dn]:

rescale the running numerator byexp(prev_max - curr_max)before adding the current block contribution
keep the numerator in float across all blocks
applyoutput /= row_sumonly once after the loop
if the value path intentionally usesp.half().float(), updaterow_sumfrom the floatexp(...)tile before the cast, then cast only for the downstream cube consume path

Files to study:

agent/example/kernels/a5/test_mla_entire.py
agent/example/kernels/a2/flash_attn_full.py

4. Absmax scaling and quantization

abs()thencmax()for row/block scalar
divide row by duplicated scalar
optionally emit scale via duplicated scalar to 64-lane row

Files to study:

agent/example/kernels/a5/matmul_chunk_absmax_norm128.py
agent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py

5. Two-pass rowwise normalization

Use when theNdimension is too large for single-pass normalization.

pass 1: matmul + row-stat accumulation into persistent UB buffer + temporary output write
pass 2: reload temporary output and normalize by accumulated stats

The row-stat buffer must persist in UB across allN-tile passes within oneMtile.

Files to study:

agent/example/kernels/a5/matmul_rowwise_norm_large_nk.py
agent/example/kernels/a5/matmul_rowwise_l2_norm.py

6. Running state across inner-loop iterations (a2 pattern)

When a kernel must accumulate a statistic (like running max for online softmax) across tiles in the inner loop, the DSL has no conditional logic (if first_iteration).

Solution: identity-element initialization + unconditional update.

Initialize the running buffer to the identity element of the accumulation operation before the inner loop usingdup, then apply the update unconditionally every iteration:

Accumulation	Identity element	Update operation	Example
running max	`neg_large`(finite negative sentinel)	`vmax(running, running, tile)`	online softmax max tracking
running sum	`0.0`	`add(running, running, tile)`	online softmax sum accumulation
running product	`1.0`	`mul(running, running, tile)`	decay chain products

Why this works: withneg_largechosen below every valid tile value,max(neg_large, x) = x,0 + x = x,1 × x = x— the first iteration naturally produces the correct initial value without special-casing.

Choosing the right tensor format for the update operation:

The update (vmax,add, etc.) is a binary element-wise operation between two UB tensors. Both must have matching stride layouts, and the operation must cover all intended elements.

On a2 without registers,cmaxoutputs dense scalars in[M, 1]format. Operate on this format directly:

# Correct: vmax on [64, 1] covers all 64 rows vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # both [HALF_M, 1]

Do NOT broadcast to[M, 8]first and then attemptvmaxbetween two[M, 8]buffers —blk_stride=0makes that operation cover only 1/8 of the elements. Seeagent/references/constraints/vec-reduction-a2.mdsection 5 for the detailed proof.

Lifetime and reset rules:

The running buffer must be reset at the beginning of eachouterloop iteration (each new M-tile gets fresh running stats)
It persists across allinnerloop iterations (N-tiles accumulate into it)
UB is per-sub-block and persistent — no special lifetime management needed

Complete pattern:

ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) with auto_sync(): for gmt in range(mt_begin, mt_end): # outer: M-tiles dup(ub_rmax_s, neg_large) # reset per M-tile with the running-max identity for nt in range(0, tiles_n): # inner: N-tiles # ... compute tile, get ub_max_s via cmax ... vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # accumulate brcb(ub_max, ub_rmax_s, ...) # broadcast AFTER update # ... subtract, exp, store ...

On a5 withReg/RegList, the same pattern uses register-leveldupandvmaxsinstead of UB-level operations. The identity-element principle is the same.

Files to study:

agent/example/kernels/a2/flash_attn_score_iter.py— validated running max pattern on a2
agent/example/kernels/a5/test_mla_entire.py— streamed running max/sum in a5 register pipeline

7. a2 vec reduction (no registers)

On a2,Reg/RegListare not available. Reductions use UB-to-UB operations:

cmax/caddreduce 64 elements to 1 scalar per repeat (dense output)
The dense output must be broadcast viabrcbbefore use insub/div
For buffers wider than 64 columns, first merge withvmax/addto 64

Complete pattern:vmax → cmax → brcb → sub(sliced for repeat alignment)

Read:agent/references/constraints/vec-reduction-a2.md

8. General rules

keep reductions in registers when possible (a5 with@vf)
on a2, usecmax → brcbUB-to-UB pattern instead
usedup()to broadcast a scalar reduction result back to full-row width before element-wise operations
for multi-pass flows, decide upfront which UB buffers persist across passes and which are reused

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

企业官网建设流程全解析

Reduction Constraints

Goal

1. Row sum normalization

2. Tiled softmax (three-pass)

3. Streaming softmax-style MLA stats

4. Absmax scaling and quantization

5. Two-pass rowwise normalization

6. Running state across inner-loop iterations (a2 pattern)

7. a2 vec reduction (no registers)

8. General rules

热门文章

文章分类

标签云

需要专业的网站建设服务？

企业官网建设流程全解析

Reduction Constraints

Goal

1. Row sum normalization

2. Tiled softmax (three-pass)

3. Streaming softmax-style MLA stats

4. Absmax scaling and quantization

5. Two-pass rowwise normalization

6. Running state across inner-loop iterations (a2 pattern)

7. a2 vec reduction (no registers)

8. General rules

热门文章

文章分类

标签云

相关文章

OpenCore Legacy Patcher终极教程：5步诊断修复让老Mac重获新生

OneNote到Markdown迁移技术解析：如何实现高效无损的笔记格式转换

ssm智能新冠疫苗接种助手（10158）

需要专业的网站建设服务？