vllm.model_executor.kernels.linear.mxfp8.Mxfp8LinearKernel ¶

Mxfp8LinearKernel ¶

Bases: ABC

Base class for MXFP8 quantized linear kernels.

Each subclass implements a specific GEMM backend (FlashInfer CUTLASS, Marlin, emulation).

Source code in vllm/model_executor/kernels/linear/mxfp8/Mxfp8LinearKernel.py

class Mxfp8LinearKernel(ABC):
    """Base class for MXFP8 quantized linear kernels.

    Each subclass implements a specific GEMM backend (FlashInfer CUTLASS,
    Marlin, emulation).
    """

    def __init__(self, c: Mxfp8LinearLayerConfig) -> None:
        assert self.can_implement(c)[0]
        assert self.is_supported()[0]
        self.config = c

    @classmethod
    @abstractmethod
    def is_supported(
        cls, compute_capability: int | None = None
    ) -> tuple[bool, str | None]:
        raise NotImplementedError

    @classmethod
    @abstractmethod
    def can_implement(cls, c: Mxfp8LinearLayerConfig) -> tuple[bool, str | None]:
        raise NotImplementedError

    @abstractmethod
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
        raise NotImplementedError

    @abstractmethod
    def apply_weights(
        self,
        layer: torch.nn.Module,
        x: torch.Tensor,
        bias: torch.Tensor | None = None,
    ) -> torch.Tensor:
        raise NotImplementedError

Mxfp8LinearLayerConfig `dataclass` ¶

Configuration for an MXFP8 linear layer.

All MXFP8 layers share the same structure: FP8-E4M3 weights with uint8 (E8M0) per-block scales at block size 32.

Source code in vllm/model_executor/kernels/linear/mxfp8/Mxfp8LinearKernel.py

@dataclass
class Mxfp8LinearLayerConfig:
    """Configuration for an MXFP8 linear layer.

    All MXFP8 layers share the same structure: FP8-E4M3 weights with
    uint8 (E8M0) per-block scales at block size 32.
    """

    pass

vllm.model_executor.kernels.linear.mxfp8.Mxfp8LinearKernel ¶

Mxfp8LinearKernel ¶

Mxfp8LinearLayerConfig dataclass ¶

Mxfp8LinearLayerConfig `dataclass` ¶