Finding common phrases using python

Question

I am trying to take a CSV file and find the common phrases and the count using Python 2.7. Currently I can only get individual words and their counts, but I need common phrases.

Here's my code so far:

import csv
from sys import argv
from collections import defaultdict
from collections import Counter
script, filename = argv
data = defaultdict(list)

with open (filename, 'rb') as f:
    reader = csv.reader(f)
    text_file = open("output.txt", "w")
    next(reader, None)
    for row in reader:
        data[row[2]].append(row[3])
        text_file.write("%r" % data)
    text_file.close()

print(data)
c = Counter(defaultdict)
print c.most_common(10)

It is not clear what your csv file contains... look like `something, something, word, count` on each row? — Hugh Bothwell, Jan 04 '16 at 17:05
My csv file contains two columns one with a product name and the other with a description. — Turtle 23, Jan 04 '16 at 18:07

score 0 · Answer 1 · answered Jan 04 '16 at 17:10

If you are going to be doing this for more than one file or for large files, I suggest using an indexing engine like Lucene.

You can Index n-grams (phrases of n-words) into Lucene and then use Lucene's query and search API to easily rank and find phrases with highest occurence.

Lucene is supported in Python with pylucene

score 0 · Answer 2 · edited May 23 '17 at 10:28

First, consider phrases using a natural language tokenizer. Even the simplest language has an enormous amount of subtleties on the definition of a sentence, i.e., trying to parse phrases with a regex is probably going to drive you crazy.

From there, use your approach on counting the frequency of "phrases", instead of words, as you are already doing, considering that "common phrases" means those that appear more than once. If that is not what you mean for "common phrases", than you should further clarify in your question.

Finding common phrases using python

2 Answers2