1

I have a Pandas dataframe , two columns "text" and "status":

text,status
Great!!, pos
I dunno., neut
Bad.,neg

There are about 6000 rows.

Text field consists of short sentences. I did a

dataset["text"] = dataset["text"].apply(strip_punctuation)

where strip_punctuation makes some string operations and returns a string. Function works on strings fast, but when I put it in apply result is a disaster I don't know why.

Any help is appreciated!

Merlin
  • 24,552
  • 41
  • 131
  • 206
dogacanb
  • 121
  • 1
  • 8

1 Answers1

4

DataFrame.apply essentially does a sequential scan of the entire DataFrame and applies your function to each row. that is super slow if your DataFrame is big.

Using vectorized methods like follows can increase performance but you get a trade-off of more complexity/less functionality.

df['text'] = df['text'].str.replace('someregextoremovepunctuation','')
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
gnicholas
  • 2,041
  • 1
  • 21
  • 32
  • Well, problem is that function is not only one regex operation, it has a sequence of regex replaces, splits... – dogacanb Jun 10 '16 at 00:42
  • 1
    @dogacanb But you should be able to rewrite it as a single regex. Of course, without the code it's hard to say... – Andy Hayden Jun 10 '16 at 00:51
  • It calls other functions as well. I first lowercase the sentence, strip off some punctuations then do stemming. I can't stem with a regex :) – dogacanb Jun 10 '16 at 00:53