How to manually perform inference in BLOOM? (with manually loaded BLOOM blocks to VRAM)

Question

Problem Introduction

Hello, I am a member of a project which partial goal is to distribute BLOOM model between several computers. The communication part between computers is not a problem for us, but we don't know how to manually perform inference. If we could know that, then we could create sort of a "chain" of computers, which hopefully will accelerate whole computation.

It is important to us to manually perform inference because if we'll succeed we'll start optimizing process of chain links selection based on graphics cards used in host PC.

Assumtions for simplicity (will be changed in the future versions):

Prompts coming to a distributed architecture are processed in series.
Amount of computers and their parameters are not an issue.
GPUs of all computers are the same.
All computers have one GPU.
There will be one BLOOM block per GPU.

Technology related part

Computer specification:

OS: Windows 10 Home
Processor: Intel i7-11700K
PC RAM: 32GB
Graphic Card: NVIDIA RTX 3060 12GB VRAM

Installed packages (pip freeze):

accelerate==0.16.0
asttokens==2.2.1
backcall==0.2.0 
certifi==2022.12.7
charset-normalizer==3.0.1
colorama==0.4.6
comm==0.1.2
debugpy==1.6.6
decorator==5.1.1
executing==1.2.0
filelock==3.9.0
huggingface-hub==0.12.0
idna==3.4
ipykernel==6.21.1
ipython==8.9.0
jedi==0.18.2
jupyter_client==8.0.2
jupyter_core==5.2.0
matplotlib-inline==0.1.6
nest-asyncio==1.5.6
numpy==1.24.1
packaging==23.0
pandas==1.5.3
parso==0.8.3
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.6.2
prompt-toolkit==3.0.36
psutil==5.9.4
pure-eval==0.2.2
Pygments==2.14.0
python-dateutil==2.8.2
pytz==2022.7.1
pywin32==305
PyYAML==6.0
pyzmq==25.0.0
regex==2022.10.31
requests==2.28.2
safetensors==0.2.8
six==1.16.0
stack-data==0.6.2
tokenizers==0.13.2
torch==1.13.1+cu117
torchaudio==0.13.1+cu117
torchvision==0.14.1+cu117
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
transformers==4.26.0
typing_extensions==4.4.0
urllib3==1.26.14
wcwidth==0.2.6

Repository arrangement (as simple as it can be):

bloom
   ↳ bigscience/bloom repository files (https://huggingface.co/bigscience/bloom/tree/main)
bloom-7b1
   ↳ bigscience/bloom-7b1 repository files (https://huggingface.co/bigscience/bloom-7b1/tree/main)
venv
   ↳ ... (venv stuff)
main.py

Code (main.py):

from pprint import pprint

import torch

from transformers import AutoTokenizer, BloomConfig
from transformers.models.bloom.modeling_bloom import BloomBlock as BloomBlock

tokenizer = AutoTokenizer.from_pretrained(r'E:\programming\python\bloom-playground\models\bloom')

prompt = "Hi, I am TapMadl and I want to "
input_ids = tokenizer(prompt, return_tensors="pt").to(0)

configuration = BloomConfig('bloom\config.json')
pprint(configuration.num_hidden_layers)

modules = torch.nn.ModuleList([BloomBlock(configuration) for _ in range(configuration.num_hidden_layers)])

pprint(type(modules[0]))
pprint(len(modules))

module1 = modules[0]

module1.forward(input_ids)

Code output:

2
<class 'transformers.models.bloom.modeling_bloom.BloomBlock'>
2
Traceback (most recent call last):
  File "c:\programming\python\bloom-prototyping\main.py", line 23, in <module>
    module1.forward(input_ids)
TypeError: BloomBlock.forward() missing 2 required positional arguments: 'alibi' and 'attention_mask'

Code description:

Initially I was working with the AutoTokenizer and AutoModelForCausalLM pipeline from the code example on bigscience/bloom code repository. I hoped for some kind of easy way to perform inference only on one block at the time. I didn't find one.

Then I used debugger to track route of function calls to examine MVP of manual inference. I realised how complicated is AutoModelForCausalLM class in practice. By tinkering with code in jupyter notebooks and debugger I managed to define code above.

If somebody could help with definig manual inference I will be glad. Thank you in advance for your help.

If you want to have control over the forward's pass, you should create a copy of `transformers.models.bloom` and edit `modeling_bloom.py` as you wish (for example the forward function on line 828). — Thomas Schillaci, Feb 13 '23 at 12:56