How do I efficiently loop over this dataframe and perform a function using inbuilt numpy or pandas?

Question

I read this article earlier and noticed that the pandas apply function, iterrows and for loop are terribly slow and efficient way of working with pandas dataframes.

I am doing sentiment analysis on some text data, but using apply causes high memory usage and low speeds similar to shown in this answer.

%%time
data.merge(data.essay.apply(lambda s: pd.Series({'neg':sid.polarity_scores(s)['neg'],
                                                 'neu':sid.polarity_scores(s)['neu'],
                                                 'pos':sid.polarity_scores(s)['pos'],
                                                 'compound':sid.polarity_scores(s)['compound']})),
                       left_index=True, right_index=True)

How can I implement this using either built-in numpy or pandas function? Edit:- The column contains essay text data

you could try [swifter](https://github.com/jmcarpenter2/swifter) — luigigi, Jan 07 '20 at 08:27
Checked it, seems it performs even worse than pandas apply as swifter uses pandas apply in my case but also does sample applies, causing additional overhead. — dracarys3, Jan 07 '20 at 09:08

score 0 · Accepted Answer · answered Jan 07 '20 at 14:04

0

I found one way to perform this function faster by using pandarallel.

By using the default pandas apply function it took 9 min 24 secs,

But by using pandarallel it completed the operation in just 1 min 7 secs (Using 16 workers).

answered Jan 07 '20 at 14:04

dracarys3

107
2
12

How do I efficiently loop over this dataframe and perform a function using inbuilt numpy or pandas?

1 Answers1