vllm.model_executor.model_loader.weight_utils ¶
Utilities for downloading and initializing model weights.
_get_checkpoints_size_bytes ¶
Return the total size of the checkpoint files in bytes.
_get_fs_type ¶
Get the filesystem type of the first file in files (Linux only).
Source code in vllm/model_executor/model_loader/weight_utils.py
_natural_sort_key ¶
Natural sort key for filenames with numeric components, such as model-00001-of-00005.safetensors -> ['model-', 1, '-of-', 5, '.safetensors']
Source code in vllm/model_executor/model_loader/weight_utils.py
_prefetch_all_checkpoints ¶
Start prefetching checkpoint files into page cache in a background thread.
Source code in vllm/model_executor/model_loader/weight_utils.py
_prefetch_checkpoint ¶
_prefetch_checkpoint(file_path: str) -> None
Prefetch a checkpoint file into the OS page cache.
Reads the file in 16MB blocks so the kernel caches its pages before workers load the same file.
Source code in vllm/model_executor/model_loader/weight_utils.py
atomic_writer ¶
atomic_writer(
filepath: str | Path,
mode: str = "w",
encoding: str | None = None,
) -> Generator[IO]
Context manager that provides an atomic file writing routine.
The context manager writes to a temporary file and, if successful, atomically replaces the original file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath | str or Path | The path to the file to write. | required |
mode | str | The file mode for the temporary file (e.g., 'w', 'wb'). | 'w' |
encoding | str | The encoding for text mode. | None |
Yields:
| Type | Description |
|---|---|
Generator[IO] | file object: A handle to the temporary file. |
Source code in vllm/model_executor/model_loader/weight_utils.py
composed_weight_loader ¶
Create a weight loader that post-processes the weights after loading
Source code in vllm/model_executor/model_loader/weight_utils.py
convert_pyslice_to_tensor ¶
convert PySafeSlice object from safetensors to torch.Tensor
PySafeSlice object supports indexing, which is done before loading the actual tensor and can reduce the amount of memory being read into the memory. However, it does not support more advanced functionalities like .view() or .t(). Therefore, if we need to modify the loaded tensor with these more complicated operators, we need to convert to tensor first.
Source code in vllm/model_executor/model_loader/weight_utils.py
default_weight_loader ¶
Default weight loader.
Source code in vllm/model_executor/model_loader/weight_utils.py
download_safetensors_index_file_from_hf ¶
download_safetensors_index_file_from_hf(
model_name_or_path: str,
index_file: str,
cache_dir: str | None,
subfolder: str | None = None,
revision: str | None = None,
) -> None
Download hf safetensors index file from Hugging Face Hub.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name_or_path | str | The model name or path. | required |
index_file | str | The safetensors index file name | required |
cache_dir | Optional[str] | The cache directory to store the model weights. If None, will use HF defaults. | required |
subfolder | Optional[str] | The subfolder within the model repository to download weights from. | None |
revision | Optional[str] | The revision of the model. | None |
Source code in vllm/model_executor/model_loader/weight_utils.py
download_weights_from_hf ¶
download_weights_from_hf(
model_name_or_path: str,
cache_dir: str | None,
allow_patterns: list[str],
revision: str | None = None,
subfolder: str | None = None,
ignore_patterns: str | list[str] | None = None,
) -> str
Download model weights from Hugging Face Hub.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name_or_path | str | The model name or path. | required |
cache_dir | Optional[str] | The cache directory to store the model weights. If None, will use HF defaults. | required |
allow_patterns | list[str] | The allowed patterns for the weight files. Files matched by any of the patterns will be downloaded. | required |
revision | Optional[str] | The revision of the model. | None |
subfolder | Optional[str] | The subfolder within the model repository to download weights from. | None |
ignore_patterns | Optional[Union[str, list[str]]] | The patterns to filter out the weight files. Files matched by any of the patterns will be ignored. | None |
Returns:
| Name | Type | Description |
|---|---|---|
str | str | The path to the downloaded model weights. |
Source code in vllm/model_executor/model_loader/weight_utils.py
515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 | |
enable_hf_transfer ¶
automatically activates hf_transfer
Source code in vllm/model_executor/model_loader/weight_utils.py
enable_xet_high_performance ¶
automatically activates xet high performance mode
fastsafetensors_weights_iterator ¶
fastsafetensors_weights_iterator(
hf_weights_files: list[str], use_tqdm_on_load: bool
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the weights in the model safetensor files using fastsafetensor library.
Source code in vllm/model_executor/model_loader/weight_utils.py
filter_files_not_needed_for_inference ¶
Exclude files that are not needed for inference.
See https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/trainer.py#L227-L233
Source code in vllm/model_executor/model_loader/weight_utils.py
get_gguf_weight_type_map ¶
get_gguf_weight_type_map(
gguf_file: str | Path,
gguf_to_hf_name_map: dict[str, str],
) -> dict[str, str]
Return GGUF mapped weight's name and its quant type
Source code in vllm/model_executor/model_loader/weight_utils.py
gguf_quant_weights_iterator ¶
gguf_quant_weights_iterator(
gguf_file: str | Path,
gguf_to_hf_name_map: dict[str, str],
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the quant weights in the model gguf files and convert them to torch tensors. Be careful of the order of yielding weight types and weights data, we have to yield all weight types first before yielding any weights. Otherwise it would cause issue when loading weights with for packed layer with different quant types.
Source code in vllm/model_executor/model_loader/weight_utils.py
gguf_quant_weights_iterator_multi ¶
gguf_quant_weights_iterator_multi(
gguf_files: list[str],
gguf_to_hf_name_map: dict[str, str],
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the quant weights across multiple GGUF shard files and convert them to torch tensors.
Like gguf_quant_weights_iterator, we yield all weight types first before yielding any weights data to avoid issues with packed layers that have different quant types.
Source code in vllm/model_executor/model_loader/weight_utils.py
initialize_dummy_weights ¶
initialize_dummy_weights(
model: Module,
model_config: ModelConfig,
low: float = -0.001,
high: float = 0.001,
seed: int = 1234,
) -> None
Initialize model weights with random values.
The model weights must be randomly initialized for accurate performance measurements. Additionally, the model weights should not cause NaNs in the forward pass. We empirically found that initializing the weights with values between -1e-3 and 1e-3 works well for most models.
We use per-parameter random seed, so that dummy weights are consistent, even if the model is partitioned across multiple devices. When the seed is fixed, the random values generated by this function only depends on the parameter's number of elements and its data type.
Source code in vllm/model_executor/model_loader/weight_utils.py
instanttensor_weights_iterator ¶
instanttensor_weights_iterator(
hf_weights_files: list[str], use_tqdm_on_load: bool
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the weights in the model safetensor files using instanttensor library.
Source code in vllm/model_executor/model_loader/weight_utils.py
maybe_download_from_modelscope ¶
maybe_download_from_modelscope(
model: str,
revision: str | None = None,
download_dir: str | None = None,
ignore_patterns: str | list[str] | None = None,
allow_patterns: list[str] | str | None = None,
) -> str | None
Download model from ModelScope hub if VLLM_USE_MODELSCOPE is True.
Returns the path to the downloaded model, or None if the model is not downloaded from ModelScope.
Source code in vllm/model_executor/model_loader/weight_utils.py
maybe_remap_kv_scale_name ¶
Remap the name of FP8 k/v_scale parameters.
This function handles the remapping of FP8 k/v_scale parameter names. It detects if the given name ends with a suffix and attempts to remap it to the expected name format in the model. If the remapped name is not found in the params_dict, a warning is printed and None is returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | The original loaded checkpoint parameter name. | required |
params_dict | dict | Dictionary containing the model's named parameters. | required |
Returns:
| Name | Type | Description |
|---|---|---|
str | str | None | The remapped parameter name if successful, or the original name if no remapping is needed. |
None | str | None | If the remapped name is not found in params_dict. |
Source code in vllm/model_executor/model_loader/weight_utils.py
1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 | |
multi_thread_pt_weights_iterator ¶
multi_thread_pt_weights_iterator(
hf_weights_files: list[str],
use_tqdm_on_load: bool,
pt_load_map_location: str | dict[str, str] = "cpu",
max_workers: int = 4,
) -> Generator[tuple[str, Tensor], None, None]
Multi-Thread iterate over the weights in the model bin/pt files.
Source code in vllm/model_executor/model_loader/weight_utils.py
multi_thread_safetensors_weights_iterator ¶
multi_thread_safetensors_weights_iterator(
hf_weights_files: list[str],
use_tqdm_on_load: bool,
max_workers: int = 4,
) -> Generator[tuple[str, Tensor], None, None]
Multi-Thread iterate over the weights in the model safetensor files.
Source code in vllm/model_executor/model_loader/weight_utils.py
np_cache_weights_iterator ¶
np_cache_weights_iterator(
model_name_or_path: str,
cache_dir: str | None,
hf_folder: str,
hf_weights_files: list[str],
use_tqdm_on_load: bool,
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the weights in the model np files.
Will dump the model weights to numpy files if they are not already dumped.
Source code in vllm/model_executor/model_loader/weight_utils.py
pt_weights_iterator ¶
pt_weights_iterator(
hf_weights_files: list[str],
use_tqdm_on_load: bool,
pt_load_map_location: str | dict[str, str] = "cpu",
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the weights in the model bin/pt files.
Source code in vllm/model_executor/model_loader/weight_utils.py
row_parallel_weight_loader ¶
Load weights that are row-parallelized.
Source code in vllm/model_executor/model_loader/weight_utils.py
runai_safetensors_weights_iterator ¶
runai_safetensors_weights_iterator(
hf_weights_files: list[str],
use_tqdm_on_load: bool,
is_distributed: bool = False,
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the weights in the model safetensor files.
Source code in vllm/model_executor/model_loader/weight_utils.py
safetensors_weights_iterator ¶
safetensors_weights_iterator(
hf_weights_files: list[str],
use_tqdm_on_load: bool,
safetensors_load_strategy: str | None = None,
local_expert_ids: set[int] | None = None,
) -> Generator[tuple[str, Tensor], None, None]
Iterate over the weights in the model safetensor files.
When local_expert_ids is provided, expert weights not belonging to this rank are skipped before reading from disk, which drastically reduces storage I/O for MoE models under EP.
Source code in vllm/model_executor/model_loader/weight_utils.py
878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 | |
sharded_weight_loader ¶
sharded_weight_loader(shard_axis: int) -> LoaderFunction
Create a weight loader that shards the weights along the given axis