0

I am looking to speed up using huggingface's tokenizer to tokenizer millions of examples.

Currently I am using a pandas column of strings and tokenizing it by defining a function with the tokenization operation, and using that with pandas map to transform my column of texts.

It's a slow process when I have millions of rows of texts, and I am wondering if there's a faster way to tokenize all my training examples.

I am not limited to pandas in particular.

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116

1 Answers1

0

Maybe you can try swifter to use multiprocess on applying pandas.

EDIT

Here is my sample code.

num_processors = 5
def do_something(text):
    pass

df['text'].swifter.set_npartitions(num_processors).apply(do_something)
  • Please don't just post some tool or library as an answer. At least demonstrate [how it solves the problem](http://meta.stackoverflow.com/a/251605) in the answer itself. –  Jun 06 '22 at 04:06
  • This doesn't seem to give an increase of speed for my case. I might have something to do with the tokenizer coming from a different library, and that library being a wrapper for code in the Rust programming language. – SantoshGupta7 Jun 06 '22 at 05:22