Problem Introduction
Hello, I am a member of a project which partial goal is to distribute BLOOM model between several computers. The communication part between computers is not a problem for us, but we don't know how to manually perform inference. If we could know that, then we could create sort of a "chain" of computers, which hopefully will accelerate whole computation.
It is important to us to manually perform inference because if we'll succeed we'll start optimizing process of chain links selection based on graphics cards used in host PC.
Assumtions for simplicity (will be changed in the future versions):
- Prompts coming to a distributed architecture are processed in series.
- Amount of computers and their parameters are not an issue.
- GPUs of all computers are the same.
- All computers have one GPU.
- There will be one BLOOM block per GPU.
Technology related part
Computer specification:
- OS: Windows 10 Home
- Processor: Intel i7-11700K
- PC RAM: 32GB
- Graphic Card: NVIDIA RTX 3060 12GB VRAM
Installed packages (pip freeze):
accelerate==0.16.0
asttokens==2.2.1
backcall==0.2.0
certifi==2022.12.7
charset-normalizer==3.0.1
colorama==0.4.6
comm==0.1.2
debugpy==1.6.6
decorator==5.1.1
executing==1.2.0
filelock==3.9.0
huggingface-hub==0.12.0
idna==3.4
ipykernel==6.21.1
ipython==8.9.0
jedi==0.18.2
jupyter_client==8.0.2
jupyter_core==5.2.0
matplotlib-inline==0.1.6
nest-asyncio==1.5.6
numpy==1.24.1
packaging==23.0
pandas==1.5.3
parso==0.8.3
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.6.2
prompt-toolkit==3.0.36
psutil==5.9.4
pure-eval==0.2.2
Pygments==2.14.0
python-dateutil==2.8.2
pytz==2022.7.1
pywin32==305
PyYAML==6.0
pyzmq==25.0.0
regex==2022.10.31
requests==2.28.2
safetensors==0.2.8
six==1.16.0
stack-data==0.6.2
tokenizers==0.13.2
torch==1.13.1+cu117
torchaudio==0.13.1+cu117
torchvision==0.14.1+cu117
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
transformers==4.26.0
typing_extensions==4.4.0
urllib3==1.26.14
wcwidth==0.2.6
Repository arrangement (as simple as it can be):
bloom
↳ bigscience/bloom repository files (https://huggingface.co/bigscience/bloom/tree/main)
bloom-7b1
↳ bigscience/bloom-7b1 repository files (https://huggingface.co/bigscience/bloom-7b1/tree/main)
venv
↳ ... (venv stuff)
main.py
Code (main.py):
from pprint import pprint
import torch
from transformers import AutoTokenizer, BloomConfig
from transformers.models.bloom.modeling_bloom import BloomBlock as BloomBlock
tokenizer = AutoTokenizer.from_pretrained(r'E:\programming\python\bloom-playground\models\bloom')
prompt = "Hi, I am TapMadl and I want to "
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
configuration = BloomConfig('bloom\config.json')
pprint(configuration.num_hidden_layers)
modules = torch.nn.ModuleList([BloomBlock(configuration) for _ in range(configuration.num_hidden_layers)])
pprint(type(modules[0]))
pprint(len(modules))
module1 = modules[0]
module1.forward(input_ids)
Code output:
2
<class 'transformers.models.bloom.modeling_bloom.BloomBlock'>
2
Traceback (most recent call last):
File "c:\programming\python\bloom-prototyping\main.py", line 23, in <module>
module1.forward(input_ids)
TypeError: BloomBlock.forward() missing 2 required positional arguments: 'alibi' and 'attention_mask'
Code description:
Initially I was working with the AutoTokenizer and AutoModelForCausalLM pipeline from the code example on bigscience/bloom code repository. I hoped for some kind of easy way to perform inference only on one block at the time. I didn't find one.
Then I used debugger to track route of function calls to examine MVP of manual inference. I realised how complicated is AutoModelForCausalLM class in practice. By tinkering with code in jupyter notebooks and debugger I managed to define code above.
If somebody could help with definig manual inference I will be glad. Thank you in advance for your help.