vllm.model_executor.layers.mamba.lamport_workspace ¶
IpcBuffer ¶
Allocates CUDA device memory and exchanges IPC handles with all ranks so that every rank holds a valid device pointer to every other rank's buffer.
Source code in vllm/model_executor/layers/mamba/lamport_workspace.py
serialize ¶
Return peer pointers as a list of int64 values (one per rank).
LamportWorkspace ¶
Self-contained workspace for Lamport-based cross-GPU AllReduce.
Parameters¶
rank : int Local rank (0-based). world_size : int Total number of ranks in the TP group. comm_size : int Size in bytes of one Lamport buffer slot. The total IPC allocation per rank is 3 * comm_size (triple-buffering). Must be large enough to hold the per-slot data written by the kernel. Use compute_comm_size_for_minimax() for a safe default. process_group : optional torch.distributed process group for IPC handle exchange. None uses the default group.
Source code in vllm/model_executor/layers/mamba/lamport_workspace.py
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | |
workspace property ¶
workspace: Tensor
Device tensor (int64) that can be passed to the kernel as void** workspace.
compute_comm_size_for_minimax staticmethod ¶
Return a safe comm_size (in bytes) for MiniMaxReduceRMSKernel.
The kernel stores per-token variance scalars in the Lamport buffer
- single-matrix path:
world_size × max_tokens × 4bytes per slot - fused Q+K path:
world_size × 2 × ceil(max_tokens/4) × 16bytes per slot
The returned value is rounded up to 2 MiB alignment.
Source code in vllm/model_executor/layers/mamba/lamport_workspace.py
_check ¶
Raise on CUDA runtime error.
_lamport_fill_neg_zero ¶
Fill device memory with IEEE-754 negative zero (-0.0f = 0x80000000). This is the "slot empty" sentinel for the Lamport protocol: the kernel spin-waits until a value is not negative zero.
Source code in vllm/model_executor/layers/mamba/lamport_workspace.py
get_allreduce_workspace ¶
get_allreduce_workspace(
rank: int,
world_size: int,
comm_size: int | None = None,
max_tokens: int = 16384,
process_group=None,
) -> Tensor
Return a cached workspace tensor for the given (rank, world_size) pair.
On first call the workspace is allocated and IPC handles are exchanged; subsequent calls with the same arguments return the cached tensor.
Parameters¶
rank, world_size : int TP rank and TP size. comm_size : int, optional Explicit slot size in bytes. If None, computed automatically from max_tokens and world_size (fused Q+K path). max_tokens : int Maximum number of tokens per batch (used when comm_size is None). process_group : optional torch.distributed process group.