Tokenizing & encoding dataset uses too much RAM

Question

Trying to tokenize and encode data to feed to a neural network.

I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”

I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not. The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.

max_q_len = 128
max_a_len = 64    
trainq_list = train_q.tolist()    
batch_size = 50000
    
def batch_encode(text, max_seq_len):
      for i in range(0, len(trainq_list), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            text,
            max_length = max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False
        )
      return encoded_sent

    # tokenize and encode sequences in the training set
    tokensq_train = batch_encode(trainq_list, max_q_len)

The tokenizer comes from HuggingFace:

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

score 3 · Accepted Answer · answered Mar 22 '21 at 15:02

You should use generators and pass data to tokenizer.batch_encode_plus, no matter the size.

Conceptually, something like this:

Training list

This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size lines at once):

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:

import pathlib


def read_in_chunks(directory: pathlib.Path):
    # Use "*.txt" or any other extension your file might have
    for file in directory.glob("*"):
        with open(file, "r") as f:
            yield f.readlines()

Encoding

Encoder should take this generator and yield back encoded parts, something like this:

# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
    for text in generator:
        yield tokenizer.batch_encode_plus(
            text,
            max_length=max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False,
        )

Saving encoded files

As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).

Something along those lines:

import numpy as np


# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
    for i, encoded in enumerate(encoding_generator):
        np.save(str(i), encoded)

Great, thanks! In my case, file_object, is already a pandas df loaded into the notebook, this doesn't seem to use more than 1gb of RAM. So, I would need to write a generator that takes this df instead of the file_object? — Exa, Mar 22 '21 at 15:42
@Exa Yeah, takes this `df` and yields slices of it (say `64` examples, the more the better, but keep in mind RAM constraints), probably as a list. — Szymon Maszke, Mar 22 '21 at 17:15
Okay, thanks! Do you think it would make sense to move the tokenizing and encoding to the training loop? So, instead of having a separate function like above it would be included in something like run_training(). — Exa, Mar 23 '21 at 08:41
Use many small functions in general, makes it easier to follow, so I'd say no. — Szymon Maszke, Mar 23 '21 at 09:52
@SzymonMaszke Can you please join all steps together for input as a list of sentences. I am also having OOM issue for 55 million data point when doing the encoding. I am not able to understand how to exactly code the steps. Please help — MAC, May 05 '22 at 04:18

Tokenizing & encoding dataset uses too much RAM

1 Answers1

Training list

Encoding

Saving encoded files