0

I have a list of keywords and I would like to count the number of times each keyword has appeared in an article. The problem is that I have more than half a million articles (in a dataframe format) and I already have a code that produce the desired results. However, it takes around 40-50 seconds to count the instances of all keywords in each article of the dataframe. I am looking for something more efficient in this regard.

I have been using str.count() command, along with a for

count_matrix= pd.DataFrame(None)
for word in keywords:
    count_matrix[str(word)]=df['article'].str.count(word)

The output is exactly as I want, the only problem is that it takes around 40-50 seconds to compute, given the fact that df['article'] has more than half a million articles. Any suggestions to make it more efficient would be highly appreciated.

roganjosh
  • 12,594
  • 4
  • 29
  • 46
  • As much as this trashes the pandas ethos, I suspect `collections.Counter` will be faster – roganjosh Aug 16 '19 at 15:44
  • thanks for reply rogan. Since I am new to python, would you please elaborate a bit further how to achieve the same results with collections.counter? – Muzammil123 Aug 16 '19 at 15:46

2 Answers2

0

Options:

  1. Convert a collection of text documents to a matrix of token counts: sklearn count vectorizer

  2. Construct a Bag of Words with Gensim or NTLK.

  3. Load massive files by chunks in pandas: panda chunks

ramm
  • 31
  • 4
0

You want a counter here of some kind. Don't keep traversing the entire DF for each word you're looking for, traverse it once and get the word counts. I'm not gonna lie, I suspect there is a better Pandas method for this, but you can build a counter this way:

import random
import string

from collections import defaultdict

import pandas as pd


df = pd.DataFrame({'a': [''.join(random.choices(list(string.ascii_lowercase),
                                                k=10))
                    for x in range(10000)]})

counts = defaultdict(dict)

for row in df.iterrows():
    row = row[1]['a'].split() # Totally useless here because there is no whitespace but, if there was, you can split on it and iterate the words
    for item in row:
        counts[item] = counts.get(item, 0) + 1

Normally, iterative approaches and Pandas don't mix at all. This seems like a corner-case that I can't see can be improved without Python iteration.

roganjosh
  • 12,594
  • 4
  • 29
  • 46