0

Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well?

Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for inference as below:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", torch_dtype=torch.float16)

input_context= "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_length=256, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

How should I load and run this model for inference on two or more GPUs using Accelerate or DeepSpeed?

Please keep in mind, this is not meant for training or finetuning a model, just inference related.

Any guidance/help would be highly appreciated, thanks in anticipation!

NeuralAI
  • 43
  • 2
  • 10

0 Answers0