I am looking to speed up using huggingface's tokenizer to tokenizer millions of examples.
Currently I am using a pandas column of strings and tokenizing it by defining a function with the tokenization operation, and using that with pandas map
to transform my column of texts.
It's a slow process when I have millions of rows of texts, and I am wondering if there's a faster way to tokenize all my training examples.
I am not limited to pandas in particular.