count all occurences of each word from a list that appear in several thousand records in python

Question

I have a list of reviews and a list of words that I am trying to count how many times each word shows in each review. The list of keywords is roughly around 30 and could grow/change. The current population of reviews is roughly 5000 with the review word count ranging from 3 to several hundred words. The number of reviews will definitely grow. Right now the keyword list is static and the number of reviews will not be growing to much so any solution to get the counts of keywords in each review will work, but ideally it will be one where there isn't a major performance issue if the number reviews drastically increase or the keywords change and all the reviews have to be reanalyzed.

I have been reading through different methods on stackoverflow and haven't been able to get any to work. I know you can use skikit learn to get the count of each word, but haven't figured out if there is a way to count a phrase. I have also tried various regex expressions. If the keyword list was all single words, I know I could very easily use skikit learn, a loop or regex, but I am having issues when the keyword has multiple words. Two links I have tried

Python - Check If Word Is In A String

Phrase matching using regex and Python

the solution here is close, but it doesn't count all occurrences of the same word How to return the count of words from a list of words that appear in a list of lists?

both the list of keywords and reviews are being pulled from a MySQL DB. All keywords are in lowercase. All text has been made lowercase and all non-alphanumeric except spaces have been stripped from the reviews. My original though was to use skikit learn countvectorizer to count the words, but not knowing how to handle counting a phrase I switched. I am currently attempting with loops and regex, but I am open to any solution

# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']

 for review in reviews:
     for word in keywords:
         results = re.findall(r'\bword\b',review)  #this returns no results, the variable word is not getting picked up
         #--also tried variations of this to no avail
         #--tried creating the pattern first and passing it
         # pattern = "r'\\b" + word + "\\b'"
         # results = re.findall(pattern,review)  #this errors with the msg: sre_constants.error: multiple repeat at position 9


#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1

@user1767754 I tried the various regex shown above in the code along re.iterall. for re.findall(r'\bword\b',review) I am not sure why no value is getting passed to the variable word — soccergal_66, Nov 25 '17 at 00:33

score 0 · Answer 1 · answered Nov 25 '17 at 00:36

0

None of the options you tried search for the value of word:

results = re.findall(r'\bword\b', review) checks for the word word in the string.
When you try pattern = "r'\\b" + word + "\\b'" you check for the string "r'\b[value of word]\b'.

You can use the first option, but the pattern should be r'\b%s\b' % word. That will search for the value of word.

answered Nov 25 '17 at 00:36

Ignacio Pascual

1
2

thanks that worked. Does %s in regex work like $1 in shell where you take the first parameter that is passed to the command? – soccergal_66 Nov 25 '17 at 00:47
%s if the format specifier for the string. When you write '%s' % variable it means that you will create a string with the value of the variable. It is useful when you need to use many variables: 'Here goes a string %s. Here goes an int %d. And here goes a float %f' % (str_var, int_var, float_var). – Ignacio Pascual Nov 25 '17 at 00:53

user1767754 · Accepted Answer · 2017-11-25T01:12:12.550

I would first do it in brute force rather than overcomplicating it and try to optimize it later.

from collections import defaultdict

keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']

results = dict()
for i in keywords:
    for j in reviews:
        results[i] = results.get(i, 0) + j.count(i)


print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}

it's importont that we query the dict with .get, in case we don't have a key set, we don't want to deal with KeyError exception.

If you want to go the complicated route, you can build your own trie and counter structure to do searches in large text files.

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

Great this worked well and yes I was definitely trying to over complicate it. I flipped the loops as I needed keyword counts by review. I will read through the link you sent as a future reference — soccergal_66, Nov 25 '17 at 01:09

count all occurences of each word from a list that appear in several thousand records in python

2 Answers2