Is there a way to improve performance of nltk.sentiment.vader Sentiment analyser?

Question

My text is derived from a social network, so you can imagine it's nature, I think text is clean and minimal as far as I could imagine; after performing following sanitization:

no urls, no usernames
no punctuation, no accents
no numbers
no stopwords (I think vader does this anyway)

I think run time is linear, and I don’t intend to do any parallelization because of the amount of effort needed to change available code, For a example, for around 1000 texts ranging from ~50 kb to ~150 kb bytes, it takes around

and the running time is around 10 minutes on my machine.

Is there a better way in feeding the algorithm to speed up cooking time? The code is as simple as SentimentIntensityAnalyzer is intended to work, here is the main part

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

This is the perfect scenario for async (if you are downloading data) or multiprocessing (if you want to post process downloaded data). Are you sure you don't want to go that way? — alec_djinn, Aug 08 '17 at 07:30
If you mean asynchronizing data retrieval from data base and data processing I don't think it will improve a lot, because Select statement is very fast compared to processing. in the other hand, I know this case is theoretically an embarrassingly parallel, as sentiment of some part of data has no impact on another part, and is a subject for async map or some other parallelization module, but in my case, I don't want to mess with this part really. I only want to play on data pre-processing I mentioned 4 steps, can we imagine more? — Curcuma_, Aug 08 '17 at 07:48
What is `gc.collect()`? I guess you could get a small performance increase if you don't use Dataframes which seem to be an overkill here. However you are not going to get much more performance without parallelizing your code or buying a more powerful CPU (and I mean in terms of power of a core, not in the number of cores). — Adonis, Aug 08 '17 at 09:22
yes Adonis, the thing I was pointing is not strictly programming efficiently, I'm more focus on the nature (possible tuning) of input data for sentiment analysis in case of social networks, in other words for voluminous data. I'm doing this for scientific studies, not in a production environment. — Curcuma_, Aug 08 '17 at 11:31
that's why I said: Is there a better way in feeding the algorithm to speed up cooking time? I hope my question is more clear — Curcuma_, Aug 08 '17 at 11:32
Just from looking at the repo readme, it seems as though one of the recent releases was performance focused. Is the NLTK version the most recent? Here's the github: https://github.com/cjhutto/vaderSentiment — Saedeas, Aug 08 '17 at 19:56

DhruvPathak · Accepted Answer · 2017-08-09T17:01:17.867

8

/1. You need not remove the stopwords, nltk+vader already does that.

/2. You need not remove the punctuation, as that affects vader's polarity calculations too, apart from the processing overhead. So, go ahead with the punctuation.

    >>> txt = "this is superb!"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
    >>> txt = "this is superb"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

/3.You shall introduce sentence tokenization too, as it would improve the accuracy, and then calculate average polarity for a paragraph based on the sentences.Example here : https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517

/4. The polarity calculations are completely independent of each other, and can use a multiprocessing pool for a small size, say 10, to provide good boost in speed.

polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

edited Aug 09 '17 at 17:01

answered Aug 09 '17 at 13:21

DhruvPathak

42,059
16
116
175

Thumbs up for the first two points, for the third are you sure Vader is not optimizing that already, since it seems very basic. for the fourth, I done multiprocessing, memory aware computation, caching and the whole stuff – Curcuma_ Aug 09 '17 at 13:49
@Abderrahimben No,it doesn't. Vader recommends sentence tokenization in its documentation too. The best thing would be try out both approaches i.e. paragraph sentiment vs average sentence sentiment to see how it goes for your data. – DhruvPathak Aug 09 '17 at 17:02
@DhruvPathak, where can I see that vader removes stop words – Loretta Aug 27 '21 at 18:04

Is there a way to improve performance of nltk.sentiment.vader Sentiment analyser?

1 Answers1