Vec Stride and Slicing Constraints
【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills
Read this file when a vec operation needs to access part of a wider buffer, or when a "narrow" source (e.g. row-max buffer) must align with a "wide" destination row by row.
Goal
Decide correctly when a vec operation can run continuously over a full buffer versus when it requires sliced views or explicit stride configuration.
1. The alignment problem
Vec operations inferrepeatfrom the destination tensor and strides from each tensor'sspan/shape. When a wide buffer (e.g.[M, 128]) is paired with a narrow buffer (e.g.[M, 8]), the repeat counts may not align row-by-row.
For float (C0=8):
[M, 128]→span1=128does not match8*C0=64orC0=8→ default strides (blk=1, rep=8)- Each row takes2 repeats(128 / 64 = 2)
[M, 8]→span1=8 == C0→blk=0, rep=1- Each row takes1 repeatfrom the narrow buffer
Ifsub(wide[M,128], wide[M,128], narrow[M,8])is called directly:
repeat = M * 128 / 64 = 2M(from dst)- narrow advances 1 per repeat → after repeat 0 (row 0 first half), narrow moves to row 1
- row 0's second half gets row 1's value→ misaligned!
2. Fix: slice the wide buffer to 64-column views
Slicing to[M, 64]creates a view wherespan1=64 == 8*C0:
blk=1, rep=shape[1]//C0(e.g.128//8=16for a 128-wide parent)- Each row takes1 repeat→ aligns with the narrow buffer's
rep=1
# Correct: sliced views ensure 1 repeat per row sub(ub[0:M, 0:64], ub[0:M, 0:64], max_buf) # first half sub(ub[0:M, 64:128], ub[0:M, 64:128], max_buf) # second halfThe slice syntax creates a Tensor view with updatedspanandoffsetwhile keeping the originalshape. The stride auto-inference usesspanfor stride selection andshapeforrep_stridecalculation, which correctly skips the full row width between repeats.
3. When slicing is NOT needed
Purely element-wise operations (no narrow source) can run continuously over the full buffer:
| Operation | Needs slicing? | Reason |
|---|---|---|
muls(wide, wide, scalar) | No | Scalar broadcasts uniformly |
exp(wide, wide) | No | Same-shape in-place, no alignment issue |
cast(half_out, float_in) | No | Same-shape element-wise conversion |
sub(wide, wide, narrow) | Yes | Narrow source advances 1 row/repeat |
vmax(dst64, wide_half1, wide_half2) | Yes | Need column views of a wider buffer |
brcb(wide, narrow) | Explicit strides | See brcb section |
Rule: if all source and destination tensors have the samespanand are operated element-wise, no slicing is needed. If any operand has a different width (narrower), slice the wider operands to match the narrow operand's per-row repeat cadence.
4. Stride auto-inference rules
Fromvecutils.infer_strides(tensor)for float (C0=8):
span[1] | Matches | blk_stride | rep_stride |
|---|---|---|---|
64(= 8×C0) | Yes | 1 | shape[1] // C0 |
8(= C0) | Yes | 0 | shape[1] // C0 |
| other | No | 1 (default) | 8 (default) |
For half (C0=16):
span[1] | Matches | blk_stride | rep_stride |
|---|---|---|---|
128(= 8×C0) | Yes | 1 | shape[1] // C0 |
16(= C0) | Yes | 0 | shape[1] // C0 |
| other | No | 1 (default) | 8 (default) |
Whenspan[0] == 1and a match occurred,rep_strideis overridden to0.
infer_repeat(tensor)always uses:span[0] * span[1] / (256 // dtype.size)
5. Column slicing via Tensor views
DSL tensor slicing (tensor[row_start:row_end, col_start:col_end]) creates a view with:
offsetadjusted to the slice startspanset to the slice extentshapeinherited from the parent (full allocation width)
This meansrep_stride = shape[1] // C0correctly accounts for the full row width, whilerepeat = span[0] * span[1] // (256 // dtype_size)only covers the sliced region.
Example forub_data[0:64, 64:128]whereub_dataisTensor(float, [64, 128]):
span = [64, 64],shape = [64, 128],offset = [0, 64]blk=1, rep=128//8=16(skips full 128-wide row)repeat = 64*64/64 = 64(one repeat per row)
Files to study
easyasc/stub_functions/vec/vecutils.py— stride inference logiceasyasc/utils/Tensor.py— slice/view creationagent/example/kernels/a2/flash_attn_score.py— practical use of sliced sub + continuous exp/cast
【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考