Peng’s Tech Blog

The Shared Memory layout of Blackwell MMAv5 operands

2026-06-07T00:00:00+00:00

The 5th gen Tensor Core on Blackwell GPU requires MMA’s operand B live in SMEM, and operand A live in SMEM or Tensor Memory (TMEM). MMAv3 (Hopper) supports “SS_GEMM” as well where A and B are both in SMEM. The layout is almost the same as v5.

In this note we analyse the SMEM layouts and examine how CUTLASS and Triton represent them. The non-swizzling cases need special handling and omitted for simplicity in this note. We also only talk about 16B atomicity swizzling for simplicity.

Core Matrix vs Swizzle Atom

Core Matrix

“Core matrix” is a deprecated term and no longer available in official documents. It used to be a term used to help define Leading Dimension Byte Offset (LBO) and Strided Dimension Byte Offset (SBO), which are two very important parameters required to supply SMEM layout representations to hardware such that Tensor Core knows where to find operands.

For example, in this Colfax tutorial,

Each core matrix has a strided direction and a contiguous direction, such that its length is 8 in the strided direction and 16 bytes in the contiguous direction. LBO (leading dimension byte offset): the distance, in bytes, between two adjacent core matrices in the K dimension. SBO (stride dimension byte offset): the distance, in bytes, between two adjacent core matrices in the M or N dimension.

which might be correct to that specific instance of MMA in the blog, but doesn’t cover MN-major cases.

As can be seen later in this note, the concept of core matrix is indeed no longer needed. And we only need Swizzle Atom to define SBO and LBO.

Swizzle Atom

As of Jun 2026, the official PTX documentation defines the SMEM layout with the concept of “Swizzle Atom”. A Swizzle Atom with s Bytes swizzling mode is a matrix of 8 * s Bytes where 8 is on the strided dimension (e.g. M/N dim for K-major) and s Bytes is on the leading dim.

All the elements in a Swizzle Atom are compactly stored in a segment of contiguous physical SMEM. The SMEM swizzling then wouldn’t “exchange” elements across two Swizzle Atom. It’s only inside one Swizzle Atom. Also note the basic unit of swizzling is 128 bits or 16 bytes (in 16B atomicity mode). No “exchange” happens inside a single unit.

SBO is then defined as the byte offset between two adjacent Swizzle Atoms on strided dim.

LBO is defined as the byte offset between two adjacent Swizzle Atoms on leading dim. Note for K-major LBO is ignored because it’s “not used, assumed to be 1” in PTX doc.

Triton

In Triton compiler there’s a function called getCoreMatrixLinearLayout. It’s in fact getting a Linear Layout tile of a Swizzle Atom despite the naming. Since Linear Layout incorporates swizzling, the output of this function already encodes the full swizzled layouts of such an Atom. e.g.

Full tensor shape: 128, 256
Layout encoding: #ttg.nvmma_shared<{swizzlingByteWidth = 128, transposed = false, elementBitWidth = 16}>
getCoreMatrixLinearLayout output: 
 - offset=1 -> (0, 1)
   offset=2 -> (0, 2)
   offset=4 -> (0, 4)
   offset=8 -> (0, 8)
   offset=16 -> (0, 16)
   offset=32 -> (0, 32)
   offset=64 -> (1, 8)
   offset=128 -> (2, 16)
   offset=256 -> (4, 32)
where out dims are: [dim0 (size 8), dim1 (size 64)]

SMEM Layouts for an MMA instruction

The PTX documentation records canonical CUTE layouts for the SMEM tensor of one single MMA instruction:

Major- ness	Swizzling mode	Canonical Layout without swizzling	Swizzling on the previous column
MN- major	No-swizzling or Interleaved	((T,1,m),(8,k)):((1,T,SBO),(1T,LBO))	Swizzle<0, 4, 3>
	32B Swizzling	((T,2,m),(8,k)):((1,T,LBO),(2T,SBO))	Swizzle<1, 4, 3>
	64B Swizzling	((T,4,m),(8,k)):((1,T,LBO),(4T,SBO))	Swizzle<2, 4, 3>
	128B Swizzling	((T,8,m),(8,k)):((1,T,LBO),(8T,SBO))	Swizzle<3, 4, 3>
K- major*	No-swizzling or Interleaved	((8,m),(T,2k)):((1T,SBO),(1,LBO))	Swizzle<0, 4, 3>
	32B Swizzling	((8,m),(T,2k)):((2T,SBO),(1,T))	Swizzle<1, 4, 3>
	64B Swizzling	((8,m),(T,2k)):((4T,SBO),(1,T))	Swizzle<2, 4, 3>
	128B Swizzling	((8,m),(T,2k)):((8T,SBO),(1,T))	Swizzle<3, 4, 3>

T = 128 / sizeof-elements-in-bits T represents scale factor which normalizes matrix element types to 128-bits.
m represents the number of repeating patterns across rows.
k represents the number of repeating patterns across columns.

* As shown later in this note, the factor k in K-major layout is in fact not needed and should be dropped.

MN major

(The figure is drawn as M/N x K following CUTLASS convention)

The Canonical CuTe layout in PTX documentation is consistent with CUTLASS, and is shown in the figure.

Inside a Swizzle Atom, there’re 8 columns that’re adjacent to each other on physical memory. Each column is a contiguous segment on physical memory and has size equal to swizzling byte width.

It’s up to the user how to distribute the Swizzle Atoms(along MN or K dim first). As long as LBO and SBO are provided, the hardware knows where in the physical memory to look for desired Atoms. Note both LBO and SBO are needed because hardware needs to know where on the physical memory to load Swizzle Atom y, z and others.

From the getCoreMatrixLinearLayout function above, Triton always distributes the Atoms along strided dim first up to a TMA block.

K major

The Canonical CuTe layout in PTX documentation is different from CUTLASS in that CUTLASS dropped factor k. We adopt CUTLASS’s layouts as the source of truth with confirmation from Nvidia.

Inside a Swizzle Atom, there’re 8 rows adjacent to each other on physical memory. Each row is a contiguous segment on physical memory and has size equal to swizzling byte width.

It’s up to the user how to distribute the Swizzle Atoms(along MN or K dim first). For 64B/128B swizzling, an MMA “unit” SMEM is smaller than a Swizzle Atom and each row in a Swizzle Atom contains elements from 2/4 different MMA instructions. This is OK because swizzling just deterministically tells hardware the exact location of each element. e.g. For 128B swizzling this table shows where in the physical SMEM to find the 8*2 T units of the first MMA instruction:

Physical Offset	16B	16B	16B	16B	16B	16B	16B	16B
0 ~ 127B	0	1
128 ~ 255B	3	2
256 ~ 383B			4	5
…			7	6
					8	9
					11	10
							12	13
							15	14

Each MMA instruction’s SMEM operand spans across multiple Swizzle Atoms along MN dim, but only one (or partial) Atom along K dim. So given location of Atom x, the hardware needs SBO to know where to load Atom z and others along MN dim. However, LBO is not needed (except non-swizzling cases, not shown in the figure) because for example Atom y is not needed by the MMA instruction in the figure.

From the getCoreMatrixLinearLayout function above, Triton always distributes the Atoms along strided dim first up to a TMA block.

References

PTX ISA Documentation v9.3
Colfax CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIA® Hopper™ GPUs
CUTLASS source code for SM100 UMMA descriptors
Triton source code for NVMMAShared encoding to Linear Layout conversion

Special thanks to Bingyi Zhang from Nvidia! A great amount of details and nuances in this note came from extensive discussions and collaborations with Bingyi.

Deriving formula for Backward Gradient of Matrix Multiplication

2026-04-12T00:00:00+00:00

Reference: https://cs231n.stanford.edu/handouts/derivatives.pdf

In this short note, we derive the backward gradient of a matrix multiplication with minimal math background.

Forward Pass

\[Y = WX\]

Matrix shapes:

\[Y \in \mathbb{R}^{M \times K}\] \[W \in \mathbb{R}^{M \times N}\] \[X \in \mathbb{R}^{N \times K}\]

Meaning of dY

In backward prop, we are given the input that has known values and the same shape as $Y$:

\[dY = \frac{\partial L}{\partial Y}\]

where $L$ is a scalar loss.

Thus:

\[dY[i][j] = \frac{\partial L}{\partial Y[i][j]}\]

Intuitively, $dY[i][j]$ measures how much the loss $L$ changes if $Y[i][j]$ increases slightly.

\[dY[i][j] = \frac{L(Y[i][j] + h) - L(Y[i][j])}{h}\]

for a very small $h$.

Goal

We want to compute

\[dX = \frac{\partial L}{\partial X}\]

Each element

\[dX[i][j] = \frac{\partial L}{\partial X[i][j]}\]

represents how much the loss changes if $X[i][j]$ increases.

Chain Rule

To compute a single element $dX[i][j]$ in $dX$, we apply the chain rule through $Y$:

\[dX[i][j]= \frac{\partial L}{\partial Y} \cdot \frac{\partial Y}{\partial X[i][j]} = \sum_{a=0}^{M-1} \sum_{b=0}^{K-1} \frac{\partial L}{\partial Y[a][b]} \cdot \frac{\partial Y[a][b]}{\partial X[i][j]}\]

Note both $\frac{\partial L}{\partial Y}$ and $\frac{\partial Y}{\partial X[i][j]}$ are of the same shape as Y, and $dX[i][j]$ is the sum of pointwise product of them.

Also note element at index $[a][b]$ in $\frac{\partial Y}{\partial X[i][j]}$ is just $\frac{\partial Y[a][b]}{\partial X[i][j]}$.

Using the shorthand $dY[a][b] = \frac{\partial L}{\partial Y[a][b]}$:

\[dX[i][j]= \sum_{a,b} dY[a][b] \cdot \frac{\partial Y[a][b]}{\partial X[i][j]}\]

Intuitively:

$\frac{\partial Y[a][b]}{\partial X[i][j]}$ measures how $X[i][j]$ affects $Y[a][b]$
$dY[a][b]$ measures how $Y[a][b]$ affects the loss $L$

Multiplying them gives the influence path

\[X[i][j] \rightarrow Y[a][b] \rightarrow L\]

Summing over all $a,b$ aggregates all such paths, showing the influence $X[i][j]$ on $L$.

Expand the Forward Definition

From matrix multiplication:

\[Y[a][b] = \sum_{c=0}^{N-1} W[a][c] \cdot X[c][b]\]

That is:

\[Y[a][b]= W[a][0]X[0][b] + W[a][1]X[1][b] + \dots + W[a][N-1]X[N-1][b]\]

Dependency Observation

If $b \ne j$, then $Y[a][b]$ does not depend on $X[i][j]$. It only depends on row $a$ of $W$ and column $b$ of $X$.

Therefore:

\[\frac{\partial Y[a][b]}{\partial X[i][j]} = 0\]

Thus only terms where $b = j$ contribute.

The formula simplifies from

\[dX[i][j]= \sum_{a,b} dY[a][b] \cdot \frac{\partial Y[a][b]}{\partial X[i][j]}\]

\[dX[i][j]= \sum_{a=0}^{M-1} dY[a][j] \cdot \frac{\partial Y[a][j]}{\partial X[i][j]}\]

Compute the Partial Derivative

Consider

\[Y[a][j]= W[a][0]X[0][j] + W[a][1]X[1][j] + \dots + W[a][i]X[i][j] + \dots\]

The only term involving $X[i][j]$ is

\[W[a][i]X[i][j]\]

Therefore

\[\frac{\partial Y[a][j]}{\partial X[i][j]} = W[a][i]\]

Then

\[dX[i][j]= \sum_{a=0}^{M-1} dY[a][j] \cdot \frac{\partial Y[a][j]}{\partial X[i][j]}\]

becomes

\[dX[i][j]= \sum_{a=0}^{M-1} dY[a][j] \cdot W[a][i]\]

then exchange position of two elements

\[dX[i][j]= \sum_{a=0}^{M-1} W[a][i] \cdot dY[a][j]\]

Recognizing the Matrix Form

Notice

\[W^T[i][a] = W[a][i]\]

Thus

\[dX[i][j]= \sum_{a=0}^{M-1} W^T[i][a] \cdot dY[a][j]\]

This is exactly the definition of matrix multiplication

\[dX = W^T dY\]

Final Result

For the forward operation

\[Y = WX\]

the backward gradients are

\[\frac{\partial L}{\partial X} = W^T dY\]

If we go through a very similar process, we could also prove

\[\frac{\partial L}{\partial W} = dY X^T\]

How to remember this: for Y=WX or Y=XW, when we want to compute dX given dY, we always just swap positions of X(dX) and Y(dY), and then just transpose W without changing its location in the equation.

Deriving formula for Backward Gradient of Softmax

2026-04-12T00:00:00+00:00

In this short note, we derive the backward gradient of a softmax calculation (in Flash Attention) with minimal math background.

Forward Pass

Given a vector of attention scores $S \in \mathbb{R}^{N}$, softmax produces:

\[P[i] = \frac{e^{S[i]}}{\sum_{k=0}^{N-1} e^{S[k]}}\]

Note $P[i] > 0$ and $\sum_i P[i] = 1$.

In flash attention, softmax is applied independently to each row of the attention score matrix $S = QK^T$. Everything below applies per row.

Meaning of dP

In backward prop, we are given the input that has known values and the same shape as $P$:

\[dP = \frac{\partial L}{\partial P}\]

where $L$ is a scalar loss. Each element:

\[dP[i] = \frac{\partial L}{\partial P[i]}\]

measures how much the loss $L$ changes if $P[i]$ increases slightly.

Goal

We want to compute

\[dS = \frac{\partial L}{\partial S}\]

Each element

\[dS[i] = \frac{\partial L}{\partial S[i]}\]

represents how much the loss changes if $S[i]$ increases.

Chain Rule

To compute a single element $dS[i]$, we apply the chain rule through $P$:

\[dS[i] = \frac{\partial L}{\partial P} \cdot \frac{\partial P}{\partial S[i]} = \sum_{j=0}^{N-1} \frac{\partial L}{\partial P[j]} \cdot \frac{\partial P[j]}{\partial S[i]} = \sum_{j=0}^{N-1} dP[j] \cdot \frac{\partial P[j]}{\partial S[i]}\]

Note both $\frac{\partial L}{\partial P}$ and $\frac{\partial P}{\partial S[i]}$ have the same shape as $P$, and $dS[i]$ is the sum of pointwise product of them.

Also note element at index $[j]$ in $\frac{\partial P}{\partial S[i]}$ is just $\frac{\partial P[j]}{\partial S[i]}$.

Note: unlike matmul, every $P[j]$ depends on $S[i]$ because $S[i]$ appears in the denominator $\sum_k e^{S[k]}$. So no terms drop out.

Compute the Partial Derivatives

We need $\frac{\partial P[j]}{\partial S[i]}$ for two cases.

Case 1: $j = i$

\[P[i] = \frac{e^{S[i]}}{\sum_k e^{S[k]}}\]

Using the quotient rule where the numerator is $e^{S[i]}$ and the denominator is $\sum_k e^{S[k]}$:

\[\frac{\partial P[i]}{\partial S[i]} = \frac{e^{S[i]} \cdot \sum_k e^{S[k]} - e^{S[i]} \cdot e^{S[i]}}{\left(\sum_k e^{S[k]}\right)^2}\] \[= \frac{e^{S[i]}}{\sum_k e^{S[k]}} - \frac{e^{S[i]}}{\sum_k e^{S[k]}} \cdot \frac{e^{S[i]}}{\sum_k e^{S[k]}}\] \[= P[i] - P[i]^2 = P[i](1 - P[i])\]

Case 2: $j \ne i$

\[P[j] = \frac{e^{S[j]}}{\sum_k e^{S[k]}}\]

Here the numerator $e^{S[j]}$ does not depend on $S[i]$, so using the quotient rule:

\[\frac{\partial P[j]}{\partial S[i]} = \frac{0 - e^{S[j]} \cdot e^{S[i]}}{\left(\sum_k e^{S[k]}\right)^2} = -P[j] \cdot P[i]\]

Substitute Back

Split the chain rule sum into the $j = i$ term and the $j \ne i$ terms:

\[dS[i] =dP[i] \cdot P[i](1 - P[i])+\sum_{j \ne i}dP[j] \cdot (-P[j] \cdot P[i])\]

Factor out $P[i]$:

\[dS[i] = P[i] \left( dP[i] \cdot (1 - P[i]) - \sum_{j \ne i} dP[j] \cdot P[j] \right)\]

Expand the first term:

\[dS[i] = P[i] \left( dP[i] - dP[i] \cdot P[i]- \sum_{j \ne i} dP[j] \cdot P[j] \right)\]

Notice that $dP[i] \cdot P[i] + \sum_{j \ne i} dP[j] \cdot P[j] = \sum_{j} dP[j] \cdot P[j]$, so:

\[dS[i] = P[i] \left( dP[i] - \sum_{j} dP[j] \cdot P[j] \right)\]

Define the dot product as a single scalar:

\[D = \sum_{j} dP[j] \cdot P[j] = dP \cdot P\]

So:

\[dS[i] = P[i] \cdot (dP[i] - D)\]

Note $D$ is a fixed scalar no matter what value $i$ is.

Final Result

For the forward operation

\[P = \text{softmax}(S)\]

the backward gradient is

\[dS[i] = P[i] \cdot (dP[i] - D)\]

where

\[D = \sum_{j} dP[j] \cdot P[j]\]

Or in vector form:

\[dS = P \odot (dP - D)\]

where $\odot$ is elementwise multiplication and $D$ is a scalar (per row).

Why This Matters for Flash Attention

In flash attention, softmax is applied row-wise to the attention score matrix $S = QK^T$, and the backward pass must be computed without materializing the full attention matrix in HBM.

The formula $dS = P \odot (dP - D)$ is perfectly suited for this because:

$D$ is just a scalar per row. By definition:

\[D[i] = \sum_j dP[i][j] \cdot P[i][j]\]

This looks like it needs both $dP$ and $P$, which are full-sized attention matrices we want to avoid materializing. But recall the forward output of attention is $O = PV$, and its gradient is $dO$. We have $dP = dO \cdot V^T$, so:

\[D[i] = \sum_j dP[i][j] \cdot P[i][j] = \sum_j \left(\sum_l dO[i][l] \cdot V[j][l]\right) \cdot P[i][j]\]

Swapping the order of summation:

\[D[i] = \sum_l dO[i][l] \sum_j P[i][j] \cdot V[j][l] = \sum_l dO[i][l] \cdot O[i][l]\]

since $\sum_j P[i][j] \cdot V[j][l] = O[i][l]$ by the forward definition $O = PV$. So:

\[D[i] = \sum_l dO[i][l] \cdot O[i][l]\]

This is just a row-wise dot product of $dO$ and $O$ — both of which are already available in HBM from the forward pass and the incoming gradient. No need to recompute $P$ or $dP$ for this step.

Recomputing $P$ per tile using saved row sum. During the forward pass, we never have a full row of true $P$ values at once — each tile only sees a partial denominator. But the forward pass saves the sum of exponentials per row:

\[L[i] = \sum_{k} e^{S[i][k]}\]

In the backward pass, for a tile covering column block $j$, we recompute the local attention scores $S[i][j] = Q[i] \cdot K[j]^T$ and recover the true softmax values for that tile:

\[P[i][j] = \frac{e^{S[i][j]}}{L[i]}\]

This is exactly the softmax definition. The key insight is that $L[i]$ encodes the full-row denominator, so any tile can produce its correct $P$ values independently.

Each tile is independent. With $D$ precomputed (step 1) and $P$ recoverable per tile (step 2), we form $dP$ from $dO$ and $V^T$ for that tile, and apply $P \odot (dP - D)$. The subtraction of $D$ is the only term that couples different columns within a row, and since $D$ is already a known scalar, each tile can be processed independently.