批量推理

多卡部署

使用huggingface

【AI大模型】Transformers大模型库(七):单机多卡推理之device_map_transformers多卡推理-CSDN博客

首先用

CUDA_VISIBLE_DEVICES=1,2,3 python
或者os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2" 限制gpu

安装transformers 和 acce库
pip install transformers -i https://mirrors.cloud.tencent.com/pypi/simple
pip install accelerate -i https://mirrors.cloud.tencent.com/pypi/simple

然后
model =AutoModelForCausalLM.from_pretrained(
model_dir,device_map="auto",trust_remote_code=True,torch_dtype=torch.float16)
 

也可以想问中一样对于模型的层进行分割然后部署

Huggingface Transformers+Accelerate多卡推理实践(指定GPU和最大显存) - 知乎

使用Pytorch自带的DDP和DP

不要用DP效率低

实践

使用transformers的auto分配显存

速率尽然要13个小时这2000条数据 但是之前单卡只十几万条才44个小时

单卡4小时左右

首先是有这个提示

We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this.

然后我用的是GPU0和GPU4是不在一张PCIE板上 

(TinyRAG) jsh@user-ESC8000A-E11:/data/jsh/code/TinyRAG$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    64-127,192-255  1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    64-127,192-255  1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    64-127,192-255  1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      64-127,192-255  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

尝试用GPU4 和 GPU7在同一个NODE上

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部