CANN/catlass优化矩阵乘法示例
2026/6/24 6:17:36 网站建设 项目流程

OptimizedMatmul Example Readme

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

Code Organization

├── 06_optimized_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── optimized_matmul.cpp # Main file

Function

This example demonstrates optimized matrix multiplication. Compared to the00_basic_matmulexample , this implementation replaces the dispatch policy withMmadAtlasA2Preloadand introduces padding preprocessing for the input matrices to improve data transfer performance.

Example

  • After obtaining the code, compile the operator executable file. For details, see Template Library Quick Start.
  • Execute the operator.
# Compile a specified test case. bash scripts/build.sh 06_optimized_matmul cd output/bin # Executable file name | Matrix M-axis | N-axis | K-axis | Device ID # The device ID is optional. The default value is 0. ./06_optimized_matmul 256 512 1024 0

If the following result is displayed, precision verification is successful.

Compare success.

Remarks

In this example, the default padding action usesPADDING_NZ. You can switch this toPADDING_BLOCK_NDto evaluate alternative performance profiles.

  • PADDING_NZThe code configuration is as follows:
constexpr PaddingTag paddingTagA = (std::is_same_v<LayoutA, layout::zN> || std::is_same_v<LayoutA, layout::nZ>) ? PaddingTag::NO_PADDING : PaddingTag::PADDING_NZ; constexpr PaddingTag paddingTagB = (std::is_same_v<LayoutB, layout::zN> || std::is_same_v<LayoutB, layout::nZ>) ? PaddingTag::NO_PADDING : PaddingTag::PADDING_NZ;

TheCOMPUTE_LENGTHallocated in the UB under thePADDING_NZpolicy is 48 KB:

static const uint32_t COMPUTE_LENGTH_A = 48 * 1024 / sizeof(ElementA); static const uint32_t COMPUTE_LENGTH_B = 48 * 1024 / sizeof(ElementB);
  • PADDING_BLOCK_NDThe modifications required to enablePADDING_BLOCK_NDare shown below. When the input matrix is not in NZ format, this policy aligns and pads the matrix according toL1TileShape:
constexpr PaddingTag paddingTagA = (std::is_same_v<LayoutA, layout::zN> || std::is_same_v<LayoutA, layout::nZ>) ? PaddingTag::NO_PADDING - : PaddingTag::PADDING_NZ; + : PaddingTag::PADDING_BLOCK_ND; constexpr PaddingTag paddingTagB = (std::is_same_v<LayoutB, layout::zN> || std::is_same_v<LayoutB, layout::nZ>) ? PaddingTag::NO_PADDING - : PaddingTag::PADDING_NZ; + : PaddingTag::PADDING_BLOCK_ND;

TheCOMPUTE_LENGTHallocated in the UB scales up to 96 KB under thePADDING_BLOCK_NDpolicy:

-static const uint32_t COMPUTE_LENGTH_A = 48 * 1024 / sizeof(ElementA); -static const uint32_t COMPUTE_LENGTH_B = 48 * 1024 / sizeof(ElementB); +static const uint32_t COMPUTE_LENGTH_A = 96 * 1024 / sizeof(ElementA); +static const uint32_t COMPUTE_LENGTH_B = 96 * 1024 / sizeof(ElementB);

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询