0

I'm trying to get a list of every word, 2-word, and 3-word phrase used in a bunch of product reviews (200K+ reviews). The reviews are provided to me as json objects. I have attempted to remove as much data from memory as possible by using generators, but I'm still running out of memory and don't quite know where to go next. I reviewed the use of generators/iterators and a very similar problem here: repeated phrases in the text Python but I still can't get it to work for a large dataset (my code works well if I take a subset of the reviews).

The format (or at least intended format) of my code is as follows: -Read in the text file containing json objects line-by-line -parse the current line to a json object and pull out the review text (there is other data in the dict which I do not need) -break the review into component words, clean the words and then add them to my master list, or increment the counter of that word/phrase if it already exists

Any assistance would be greatly appreciated!

import json
import nltk
import collections

#define set of "stopwords", those that are removed
s_words=set(nltk.corpus.stopwords.words('english')).union(set(["it's", "us", " "]))

#load tokenizer, which will split text into words, and stemmer - which stems words
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
stemmer = nltk.SnowballStemmer('english')
master_wordlist = collections.defaultdict(int)
#open the raw data and read it in by line
allReviews = open('sample_reviews.json')
lines = allReviews.readlines()
allReviews.close()


#Get all of the words, 2 and 3 word phrases, in one review
def getAllWords(jsonObject):
    all_words = []
    phrase2 = []
    phrase3 = []

    sentences=tokenizer.tokenize(jsonObject['text'])
    for sentence in sentences:
        #split up the words and clean each word
        words = sentence.split()

        for word in words:
            adj_word = str(word).translate(None, '"""#$&*@.,!()-                     +?/[]1234567890\'').lower()
            #filter out stop words
            if adj_word not in s_words:

                all_words.append(str(stemmer.stem(adj_word)))

                #add all 2 word combos to list
                phrase2.append(str(word))
                if len(phrase2) > 2:
                    phrase2.remove(phrase2[0])
                if len(phrase2) == 2:
                    all_words.append(tuple(phrase2))

                #add all 3 word combos to list
                phrase3.append(str(word))
                if len(phrase3) > 3:
                    phrase3.remove(phrase3[0])
                if len(phrase3) == 3:
                    all_words.append(tuple(phrase3))

    return all_words
#end of getAllWords

#parse each line from the txt file to a json object
for c in lines:
    review = (json.loads(c))
    #counter instances of each unique word in wordlist
    for phrase in getAllWords(review):
        master_wordlist[phrase] += 1
Community
  • 1
  • 1
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62

1 Answers1

1

i believe calling readlines loads the whole file into memory, there should be less overhead just to iterate over the file object line by line

#parse each line from the txt file to a json object
with open('sample_reviews.json') as f:
  for line in f:
    review = (json.loads(line))
    #counter instances of each unique word in wordlist
    for phrase in getAllWords(review):
        master_wordlist[phrase] += 1
dm03514
  • 54,664
  • 18
  • 108
  • 145
  • Thanks for the reply. I will re-write to remove the readlines. My assumption was that since the code doesn't bomb out until it's run for a while that I had some runaway memory issue after that. I'll try fixing that first. – flyingmeatball May 16 '13 at 18:08
  • @flyingmeatball where are your generators? can `getAllWords` `yield` a result instead of building and returning a list? – dm03514 May 16 '13 at 18:09
  • When I try and implement yield for getAllWords I get an 'unhashable type:list' error on the line master_wordlist[phrase] += 1 If you have any suggestions I'd be happy to listen. – flyingmeatball May 16 '13 at 18:47