How to speed up my Python apply function across a DataFrame

Question

I have a rather large data set and I am trying to calculate the sentiment across each document. I am using Vader to calculate the sentiment with the following code, but this process takes over 6 hours to run. I am looking for any way to speed up this process.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

%time full_trans['bsent'] = full_trans['body_text'].apply(lambda row: analyzer.polarity_scores(row))

Any thoughts would be great because looping through the rows like this is terribly inefficient.

As an example, I have run my code on a mini sample of 100 observations. The results from the alternative forms of code are below. My original code is first, the suggested change to a list comprehension is second. It seems strange that there is no increase in performance between the two methods.

transtest = full_transx.copy(deep=True)

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

%time transtest['bsent'] = [analyzer.polarity_scores(row) for row in transtest['body_text']]

%time full_transx['bsent'] = full_transx['body_text'].apply(lambda row: analyzer.polarity_scores(row))

Wall time: 4min 11s

Wall time: 3min 59s

I think you can find the answer [here](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code/54432584#54432584). If you want to go deeper, there are [cython](https://docs.cython.org/en/latest/) and [numba](https://numba.pydata.org/) libraries. — Cobra, Jul 04 '19 at 13:08

score 0 · Answer 1 · answered Jul 04 '19 at 13:25

0

I assume that full_transx['body_text'] is a Series of strings. In that case it is often much more efficient to loop over the underlying numpy array to build a list comprehension:

full_trans['bsent'] = [analyzer.polarity_scores(row) for row in full_trans['body_text'].values]

answered Jul 04 '19 at 13:25

Serge Ballesta

143,923
11
122
252

Sorry for leaving that out, yes, 'body_text' is a DataFrame column where each row is the transcript from a phone call. – krats Jul 04 '19 at 13:32

score 0 · Answer 2 · answered Jul 04 '19 at 14:25

0

it is not efficient to loop through numpy arrays. I suggest you to find a way of applying the function onto the array itself. I am not able to try it, but perhaps you can try analyzer.polarity_scores(full_trans['body_text'].values)

answered Jul 04 '19 at 14:25

Axois

1,961
2
11
22

How to speed up my Python apply function across a DataFrame

2 Answers2