Reduce inference time for BERT

Question

I want to further improve the inference time from BERT. Here is the code below:

for sentence in list(data_dict.values()):
    tokens = {'input_ids': [], 'attention_mask': []}
    new_tokens = tokenizer.encode_plus(sentence, max_length=512,
                                        truncation=True, padding='max_length',
                                        return_tensors='pt',
                                        return_attention_mask=True)
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

    # reformat list of tensors into single tensor
    tokens['input_ids'] = torch.stack(tokens['input_ids'])
    tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

    outputs = model(**tokens)
    embeddings = outputs[0]

Is there a way to provide batches (like in training) instead of the whole dataset?

How are you preparing the batches in your training iteration? Also, do you store activations in your example (context for that is missing), or use the `with torch.no_grad()`mode? Batching works the same way for inference as it does for training. — dennlinger, Sep 15 '21 at 12:28
Thanks for responding! For the training, I'm using TrainingArguments and Trainer from Huggingface. As for the activations, I dont store them, so torch.no_grad(). — otatopeht, Sep 15 '21 at 14:15
Have you considered quantising your model to use weights with low prescision data types. You can use low prescision data types with minimal impacts in accuracy. — ArunJose, Sep 24 '21 at 05:54

dennlinger · Answer 1 · 2021-09-22T09:47:57.187

There are several optimizations that we can do here, which are (mostly) natively supported by the Huggingface tokenizer.

TL;DR, an optimized version would be this one, I have explained the ideas behind each change below.

def chunker(seq, batch_size=16):
    return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))

for sentence_batch in chunker(list(data_dict.values())):
    tokenized_sentences = tokenizer(sentence_batch, max_length=512,
                                truncation=True, padding=True,
                                return_tensors="pt", return_attention_mask=True)
    with torch.no_grad():
        outputs = model(**tokenized_sentences)

The first optimization is to batch together several samples at the same time. For this, it is helpful to have a closer look at the actual __call__ function of the tokenizer, see here (bold highlight by me):

text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded [...].

This means it is enough if we can simply pass several samples at the same time, and we already get the readily processed batch back. I want to personally note that it would be in theory possible to pass the entire list of samples at once, but there are also some drawbacks that we go into later.

To actually pass a decently sized number of samples to the tokenizer, we need a function that can aggregate several samples from the dictionary (our batch-to-be) in a single iteration. I've used another Stackoverflow answer for this, see this post for several valid answers. I've chosen the highest-voted answer, but do note that this creates and explicit copy, and might therefore not be the most memory-efficient solution. Then you can simply iterate over the batches, like so:

def chunker(seq, batch_size=16):
    return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))

for sentence_batch in chunker(list(data_dict.values())):
    ...

The next optimization is in the way you can call your tokenizer. Your code does this with many several steps, which can be aggregated into a single call. For the sake of clarity, I also point out which of these arguments are not required in your call (this often improves your code readability).

tokenized_sentences = tokenizer(sentence_batch, max_length=512,
                                truncation=True, padding=True,
                                return_tensors="pt", return_attention_mask=True)
with torch.no_grad():  # Just to be sure
    outputs = model(**tokenized_sentences)

I want to comment on the use of some of the arguments as well:

max_length=512: This is only required if your value differs from the model's default max_length. For most models, this will otherwise default to 512.
return_attention_mask: Will also default to the model-specific values, and in most cases does not need to be set explicitly.
padding=True: If you noticed, this is different from your version, and arguably what gives you the most "out-of-the-box" speedup. By using padding=max_length, each sequence will be computing quite a lot of unnecessary tokens, since each input is 512 tokens long. For most real-world data I have seen, inputs tend to be much shorter, and therefore you only need to consider the longest sequence length in your batch. padding=True does exactly that. For actual (CPU inference) speedups, I have played around with some different sequence lengths myself, see my repository on Github. Noticeably, for the same CPU and different batch sizes, there is a 10x speedup possible.

Edit: I've added the torch.no_grad() here, too, just in case somebody else wants to use this snippet. I generally recommend to use it right before the piece of code that is actually affected by it, just so that nothing gets overlooked by accident.

Also, there are some more possible optimizations that require you to have a bit more insights into your data samples:
If the variance of sample lengths is quite drastic, you can get an even higher speedup if you sort your samples by length (ideally, tokenized length, but character length / word count will also give you an approximate idea). That way, when batching several samples together, you minimize the amount of padding that is required.

score 0 · Answer 2 · answered Nov 07 '21 at 08:58

0

Maybe you might be interested in Intel OpenVINO backend for inference execution on CPU? It's currently work in progress on branch https://github.com/huggingface/transformers/pull/14203

answered Nov 07 '21 at 08:58

Dmitry Kurtaev

823
6
14

score 0 · Answer 3 · answered Jun 16 '22 at 15:24

I had the same issue of time inference with Bert on the CPU. I started using HuggingFace Pipelines for inference, and the Trainer for training. It's well documented on HuggingFace.

The pipeline makes it simple to perform inference on batches. On one pass, you can get the inference done instead of looping on a sequence of single texts.

Reduce inference time for BERT

3 Answers3