Fastest way to tokenize millions of examples?

Question

I am looking to speed up using huggingface's tokenizer to tokenizer millions of examples.

Currently I am using a pandas column of strings and tokenizing it by defining a function with the tokenization operation, and using that with pandas map to transform my column of texts.

It's a slow process when I have millions of rows of texts, and I am wondering if there's a faster way to tokenize all my training examples.

I am not limited to pandas in particular.

this may help ... https://stackoverflow.com/questions/44695020/pandas-vs-sql-speed — jsotola, Jun 06 '22 at 02:20

viet quan dang · Answer 1 · 2022-06-06T04:11:10.870

0

Maybe you can try swifter to use multiprocess on applying pandas.

EDIT

Here is my sample code.

num_processors = 5
def do_something(text):
    pass

df['text'].swifter.set_npartitions(num_processors).apply(do_something)

edited Jun 06 '22 at 04:11

answered Jun 06 '22 at 04:05

viet quan dang

69
2

Please don't just post some tool or library as an answer. At least demonstrate [how it solves the problem](http://meta.stackoverflow.com/a/251605) in the answer itself. – Jun 06 '22 at 04:06
This doesn't seem to give an increase of speed for my case. I might have something to do with the tokenizer coming from a different library, and that library being a wrapper for code in the Rust programming language. – SantoshGupta7 Jun 06 '22 at 05:22

Fastest way to tokenize millions of examples?

1 Answers1