vllm.model_executor.layers.quantization.utils.mxfp8_utils ¶
_mxfp8_e4m3_quantize_torch ¶
_mxfp8_e4m3_quantize_torch(
x: Tensor, is_sf_swizzled_layout: bool = False
) -> tuple[Tensor, Tensor]
Naive MXFP8 quantization. For each block of 32 elements along the last dimension, compute a shared e8m0 scale (the biased exponent of the block-wise amax) and quantize each element to float8_e4m3fn.
Returns (quantized_values [same shape, fp8], scales uint8). Scale shape depends on is_sf_swizzled_layout: False -> [..., K//32] (row-major 2D) True -> [flat swizzled 1D]
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
dequant_mxfp8_to_bf16 ¶
Dequantize MXFP8 tensor to BF16.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
mxfp8_e4m3_quantize_fake ¶
Fake implementation for torch.compile tracing.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
swizzle_mxfp8_scale ¶
Swizzle MXFP8 scales from row-major 2D to F8_128x4 layout.