How to speedup the counting of matching strings in a large dataframe?

Question

I have a list of keywords and I would like to count the number of times each keyword has appeared in an article. The problem is that I have more than half a million articles (in a dataframe format) and I already have a code that produce the desired results. However, it takes around 40-50 seconds to count the instances of all keywords in each article of the dataframe. I am looking for something more efficient in this regard.

I have been using str.count() command, along with a for

count_matrix= pd.DataFrame(None)
for word in keywords:
    count_matrix[str(word)]=df['article'].str.count(word)

The output is exactly as I want, the only problem is that it takes around 40-50 seconds to compute, given the fact that df['article'] has more than half a million articles. Any suggestions to make it more efficient would be highly appreciated.

As much as this trashes the pandas ethos, I suspect `collections.Counter` will be faster — roganjosh, Aug 16 '19 at 15:44
thanks for reply rogan. Since I am new to python, would you please elaborate a bit further how to achieve the same results with collections.counter? — Muzammil123, Aug 16 '19 at 15:46

score 0 · Answer 1 · answered Aug 16 '19 at 15:58

0

Options:

Convert a collection of text documents to a matrix of token counts: sklearn count vectorizer
Construct a Bag of Words with Gensim or NTLK.
Load massive files by chunks in pandas: panda chunks

answered Aug 16 '19 at 15:58

ramm

31
4

roganjosh · Answer 2 · 2019-08-16T16:07:35.833

You want a counter here of some kind. Don't keep traversing the entire DF for each word you're looking for, traverse it once and get the word counts. I'm not gonna lie, I suspect there is a better Pandas method for this, but you can build a counter this way:

import random
import string

from collections import defaultdict

import pandas as pd


df = pd.DataFrame({'a': [''.join(random.choices(list(string.ascii_lowercase),
                                                k=10))
                    for x in range(10000)]})

counts = defaultdict(dict)

for row in df.iterrows():
    row = row[1]['a'].split() # Totally useless here because there is no whitespace but, if there was, you can split on it and iterate the words
    for item in row:
        counts[item] = counts.get(item, 0) + 1

Normally, iterative approaches and Pandas don't mix at all. This seems like a corner-case that I can't see can be improved without Python iteration.

How to speedup the counting of matching strings in a large dataframe?

2 Answers2