1

I am using a quite large GPU which is around 80 GB. The training epochs runs fine but for some reason when evaluating (the training set and validation sets have the same length more or less), I am running out of memory and getting this error:

File "/home.../transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total 
capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by 
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to 
avoid fragmentation.  See documentation for Memory Management and 
 PYTORCH_CUDA_ALLOC_CONF

The training and validation data was created like this:

train_texts, train_labels = read_dataset('basic_train.tsv') 

val_texts, val_labels = read_dataset('basic_val.tsv')  

train_encodings = tokenizer(train_texts, truncation=False, padding=True) 
val_encodings = tokenizer(val_texts, truncation=False, padding=True)

class Dataset(torch.utils.data.Dataset):     
    def __init__(self, encodings, labels):         
        self.encodings = encodings         
        self.labels = labels 
         ...         
        return item 

train_dataset = Dataset(train_encodings, train_labels) 
val_dataset = Dataset(val_encodings, val_labels) 

My training code looks like this:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             
per_device_train_batch_size=8,  
per_device_eval_batch_size=8,   
warmup_steps=500,                
weight_decay= 5e-5,              
logging_dir='./logs',            
logging_steps=10,
learning_rate= 2e-5,
eval_steps= 100,
save_steps=30000,
evaluation_strategy= 'steps'
)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")


metric = load_metric('accuracy')

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return metric.compute(predictions=predictions, references=labels)

def collate_fn_t5(batch):
  input_ids = torch.stack([example['input_ids'] for example in batch])
  attention_mask = torch.stack([example['attention_mask'] for example in batch])
  labels = torch.stack([example['input_ids'] for example in batch])
   return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}


trainer = Trainer(
model=model,                       
args=training_args,                  
train_dataset=train_dataset,         
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
data_collator=collate_fn_t5,
        # evaluation dataset
 )

trainer.train()

eval_results = trainer.evaluate()
alvas
  • 115,346
  • 109
  • 446
  • 738
Chan Wing
  • 59
  • 6

1 Answers1

0

From

RuntimeError: CUDA out of memory. Tried to allocate 33.84 GiB (GPU 0; 79.35 GiB total capacity; 36.51 GiB already allocated; 32.48 GiB free; 44.82 GiB reserved in total by PyTorch)

Most probably, that's because it takes

  • 79.35 GB available

Then in RAM

  • 36.51 GB allocated, most probably model loaded onto GPU RAM
  • 44.82 GB reserved, should be including 36.51 allocated + pytorch overheads

And you need

  • 33.84 GB for the evaluation batch
  • but only 32.48 GB is available

So I guess there's a few options, you can try reducing the per_device_eval_batch_size, from 7 all the way to 1 to see if what works, e.g.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,                
...)

If that doesn't work, perhaps its the default accumulation, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps

You can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,                
...)

Sometimes it's also how predict is not generating by default. I'm not sure why that would happen but I think when it's just predicting with the model.eval() or with torch.no_grad() when the predict_with_generate is set to False, it takes some some overhead. But that's just my speculation, https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378

If so, you can try:

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          
per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
...)

Or you could try auto_find_batch_size, i.e.

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,             

predict_with_generate=True,
auto_find_batch_size=True,
...)

A few more memory tricks:

# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)

Then if it's still not working, try the algorithmic tricks.

From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

training_args = TrainingArguments(
output_dir='./results',          
num_train_epochs=10,   
          

fp16=True,
optim="adafactor",
gradient_checkpointing=True,


per_device_train_batch_size=8,  
per_device_eval_batch_size=1,
eval_accumulation_steps=1,  
predict_with_generate=True,
alvas
  • 115,346
  • 109
  • 446
  • 738
  • I actually tried with batch size of 1 in evaluation and still got the error! – Chan Wing Mar 20 '23 at 17:14
  • Let us know if which of the options works! Hope one of them work. – alvas Mar 20 '23 at 17:22
  • Unfortunately, none of the above options worked. However, I wasnt able to set predict_with_generate to True as it was throwing `TypeError: __init__() got an unexpected keyword argument 'predict_with_generate'`. Auto_find_batch_size was throwing raise `RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero` – Chan Wing Mar 21 '23 at 09:41
  • I even tried with `per_device_train_batch_size=1, per_device_eval_batch_size=1, eval_accumulation_steps=1`, and still got OoM error which makes no sense – Chan Wing Mar 21 '23 at 09:43
  • Ah, `pip install -U transfomrers>=4.27.1` – alvas Mar 21 '23 at 11:06
  • yeah I actually installed the latest version but still no luck – Chan Wing Mar 21 '23 at 13:34
  • Does `AutoModelForSeq2SeqLM.from_pretrained("google/t5-efficient-tiny")` work for you? – alvas Mar 21 '23 at 13:40
  • Which version of huggingface are your running? – alvas Mar 21 '23 at 13:49
  • I was working on transformers '4.19.4' after the error with "predict_with_generate" I upgraded to the newest version. I will try 'google/t5-efficient-tiny' and let you know. Thanks for your help – Chan Wing Mar 21 '23 at 13:56
  • Is the newest version 4.26? or 4.27? cos < 4.26.1, you might see encounter some bugs. – alvas Mar 21 '23 at 14:11
  • Quick question, did you load your evaluate dataset to GPU ram? How is your `val_dataset` created? – alvas Mar 21 '23 at 14:12
  • `train_texts, train_labels = read_dataset('basic_train.tsv') val_texts, val_labels = read_dataset('basic_val.tsv') train_encodings = tokenizer(train_texts, truncation=False, padding=True) val_encodings = tokenizer(val_texts, truncation=False, padding=True) class Dataset(torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels ... return item train_dataset = Dataset(train_encodings, train_labels) val_dataset = Dataset(val_encodings, val_labels) ` – Chan Wing Mar 21 '23 at 14:23
  • 1
    google/t5-efficient-tiny gave same error. not sure what is going on. there has to be a mistake in my code. – Chan Wing Mar 21 '23 at 14:24
  • 1
    `truncation=False` might be the issue, if you have truncation false, it might overload the batch, try `truncation=True`. – alvas Mar 21 '23 at 14:29
  • Could you add 1-2 sample lines in `basic_train.tsv` that might give an idea of whether it's the text length giving your the memory issues. – alvas Mar 21 '23 at 14:30
  • I have limited my sequence length to no more than 400. I have tried a different code that uses pytorch lightning using t5 on 8gb memory GPU with the same data and it worked fine. very wierd! – Chan Wing Mar 21 '23 at 14:33
  • I'm stuck on a similar problem, would appreciate your look [here](https://stackoverflow.com/questions/76099140/hugging-face-transformers-cuda-error-cublas-status-not-initialize) – Dolev Mitz May 01 '23 at 08:56
  • Did you find a solution @ChanWing? I have the same issue with the Trainer at the moment. – BoomBoxBoy Sep 01 '23 at 23:07