vllm.model_executor.kernels.linear.nvfp4 ¶
Modules:
| Name | Description |
|---|---|
base | |
cutlass | |
emulation | |
fbgemm | |
flashinfer | |
marlin | |
NvFp4LinearKernel ¶
Bases: ABC
Base class for NVFP4 quantized linear kernels.
Each subclass implements a specific GEMM backend (CUTLASS, Marlin, etc). The kernel selection mechanism iterates over registered subclasses in priority order,calling is_supported and can_implement to find the best match for the current hardware.
Source code in vllm/model_executor/kernels/linear/nvfp4/base.py
apply_weights abstractmethod ¶
Run the quantized GEMM.
can_implement abstractmethod classmethod ¶
can_implement(
config: NvFp4LinearLayerConfig,
) -> tuple[bool, str | None]
Return whether this kernel can handle config.
is_supported abstractmethod classmethod ¶
Return whether this kernel can run on the current platform.
process_weights_after_loading abstractmethod ¶
process_weights_after_loading(layer: Module) -> None
Transform weights into the format required by this kernel.
Called once after checkpoint weights have been loaded onto the device. Implementations should repack / swizzle / pad weights and scales in-place on layer.
Source code in vllm/model_executor/kernels/linear/nvfp4/base.py
NvFp4LinearLayerConfig dataclass ¶
Configuration for an NVFP4 linear layer.
All NVFP4 layers share the same structure: packed uint8 weights (2 FP4 values per byte), FP8-E4M3 per-block weight scales (group size 16), and scalar global scales for both weights and activations.