CANN/PID批量滚动评分算法-酒店常州论坛

PidFopdtBatchRolloutScore Algorithm

【免费下载链接】mat-chem-sim-pred面向工业领域，聚焦计算仿真、预测两大核心场景，构建面向流程工业"机理+数据"双轮驱动的领域计算层，推动AI for Science在材料化学领域的深度应用。项目地址: https://gitcode.com/cann/mat-chem-sim-pred

Purpose

This operator evaluates many PID candidates for many FOPDT loops during the tuning stage and returns the best candidate for each loop.

The target workload is:

batch loops x candidate set x rollout time steps

Model

The plant model is discretized FOPDT:

y[k+1] = a * y[k] + b * u[k-delay]

The PID law is:

e[k] = sp - y[k] integral += e[k] * dt derivative = (e[k] - e[k-1]) / dt u[k] = clamp(Kp * e[k] + Ki * integral + Kd * derivative, -10, 10)

Score

For each candidate, the rollout accumulates:

IAE
ISE
overshoot
settling_time
control_energy

The optimization target is:

score = IAE + overshoot_weight * overshoot + settling_weight * settling_time + control_weight * control_energy

The operator returns the candidate with minimumscore.

NPU Execution Strategy

The current implementation uses a two-stage tiled structure:

host splits the candidate axis into tiles
local kernel evaluates one tile for all assigned loops and writes partial best results
final kernel reduces all tile-local best results into one best result per loop

This structure was chosen because the earlier single-launch(loop, tile)task mapping showed unstable coverage onnode202. The current host-per-tile launch plus conservative loop-range partitioning restores correctness.

Vectorization

The rollout time dimension is a serial recurrence (y[k+1]depends ony[k]) and cannot be turned into GEMM-style dense math without dropping the per-step nonlinearities (control clamp) and the nonlinear score functionals (IAE/ISE/overshoot/settling), so the kernel keeps the exact step-by-step recurrence.

The parallelism instead lives on the candidate axis: every timestep applies the same chain of vector ops to all candidates at once. Because the recurrence is serial, that chain of dependent vector ops cannot be pipelined across timesteps, so with a narrow lane the inner loop is bound by per-instruction issue/latency rather than by compute throughput. The kernel therefore evaluates the candidate axis with a wide SIMD lane (kLane=768): more candidates per vector instruction means fewer instructions for the same work, which amortises the fixed instruction latency and makes the loop throughput-bound.kLane=768is the largest lane that keeps the 8 state vectors + scratch + the 32-slot delay ring (delay spec0..31) + I/O queues within the 192 KB UB budget. Widening the lane is a pure layout change and leaves the output bit-identical.

Engineering Conclusion

This operator is valuable as:

an independent PID tuning operator sample
a correctness-verified NPU exploration artifact (NPU output bit-identical to the CPU reference)
a single-card rollout that, after the wide-lane optimization plus the fused inner loop, runs roughly 4-7x the 64-thread CPU baseline at the candidate counts that dominate the tuning sweep

The inner loop was also reduced from ~37 to ~32 vector ops per timestep by reusing the response error as the next step's error and by folding the non-feedback metric accumulators (IAE/ISE/control energy) into fused multiply-accumulates; this is bit-identical to the original. The remaining single-card headroom is a cheaper settling reduction; multi-card data parallelism scales the absolute time further but is a hardware lever, not a single-card algorithmic speedup.

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

企业官网建设流程全解析

PidFopdtBatchRolloutScore Algorithm

Purpose

Model

Score

NPU Execution Strategy

Vectorization

Engineering Conclusion

热门文章

文章分类

标签云

需要专业的网站建设服务？

企业官网建设流程全解析

PidFopdtBatchRolloutScore Algorithm

Purpose

Model

Score

NPU Execution Strategy

Vectorization

Engineering Conclusion

热门文章

文章分类

标签云

相关文章

2026沧州本地贵金属变现门店精选前五+黄金铂金白银金条回收合规商家名录 含地址电话

Steam Achievement Manager完整指南：开源Steam成就管理工具终极教程

Genome快速入门：5分钟内学会Swift JSON数据映射

需要专业的网站建设服务？

2026沧州本地贵金属变现门店精选前五+黄金铂金白银金条回收合规商家名录含地址电话