vllm.model_executor.layers.quantization.utils.nvfp4_utils ¶
pad_nvfp4_activation_for_cutlass ¶
Pad packed FP4 activations to match the K-dimension padding applied to weights. The padding is in bytes (tensor dimension), not FP4 elements.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_utils.py
pad_nvfp4_weight_for_cutlass ¶
Pad packed NVFP4 weights so that both N (rows) and K (columns) satisfy the alignment constraints required by CUTLASS / FlashInfer FP4 kernels.
CUTLASS FP4 kernel requires both K and N matrix dimensions to be divisible by 32 for aligned memory access and efficient tensor core operations.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_utils.py
slice_nvfp4_output ¶
Slice the output tensor to remove padding in N dimension if weight was padded.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_utils.py
swizzle_blockscale ¶
Pad and block-interleave the FP4 block-scales so that they match the data layout expected by the CUTLASS / FlashInfer kernels.
Parameters¶
scale: torch.Tensor
Returns¶
torch.Tensor The swizzled tensor with the same logical shape as scale.