Faster alternative to Pandas apply, text data

Question

I have a Pandas dataframe , two columns "text" and "status":

text,status
Great!!, pos
I dunno., neut
Bad.,neg

There are about 6000 rows.

Text field consists of short sentences. I did a

dataset["text"] = dataset["text"].apply(strip_punctuation)

where strip_punctuation makes some string operations and returns a string. Function works on strings fast, but when I put it in apply result is a disaster I don't know why.

Any help is appreciated!

@Merlin, I'm on cellphone, not a proper computer struggling to format ;) — dogacanb, Jun 10 '16 at 00:27
Possible Dup - http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python — Merlin, Jun 10 '16 at 00:36
@Merlin , I don't have only punctuation stripping in the function, there are other string operations as well. — dogacanb, Jun 10 '16 at 00:43
@cricket_007 , it processed 2400 rows in 2 hours?!? I fed sentences to the function by hand on the shell, every sentence takes less than 1 sec. I have zero idea what is going on. — dogacanb, Jun 10 '16 at 00:47
"Function works on strings fast" clearly this can't be true. — Andy Hayden, Jun 10 '16 at 00:50
@Andy Hayden sentences are short, at most 5-6 words. Function is a string processing function at the end but strings are short :) — dogacanb, Jun 10 '16 at 00:54
Try dataset["text"] = map(strip_punctuation, dataset["text"]) — Kevin K., Jun 10 '16 at 00:57
@Kevin K, what is difference here between apply and map by performance? — dogacanb, Jun 10 '16 at 01:01
@dogacanb The built in map function should be able to perform the operation in parallel. Pandas is notorious for lacking parallelism. — Kevin K., Jun 10 '16 at 01:08

score 4 · Answer 1 · edited Sep 03 '19 at 04:03

4

DataFrame.apply essentially does a sequential scan of the entire DataFrame and applies your function to each row. that is super slow if your DataFrame is big.

Using vectorized methods like follows can increase performance but you get a trade-off of more complexity/less functionality.

df['text'] = df['text'].str.replace('someregextoremovepunctuation','')

edited Sep 03 '19 at 04:03

NelsonGon

13,015
7
27
57

answered Jun 10 '16 at 00:35

gnicholas

2,041
1
21
32

Well, problem is that function is not only one regex operation, it has a sequence of regex replaces, splits... – dogacanb Jun 10 '16 at 00:42
1

@dogacanb But you should be able to rewrite it as a single regex. Of course, without the code it's hard to say... – Andy Hayden Jun 10 '16 at 00:51
It calls other functions as well. I first lowercase the sentence, strip off some punctuations then do stemming. I can't stem with a regex :) – dogacanb Jun 10 '16 at 00:53

Faster alternative to Pandas apply, text data

1 Answers1