0

I am trying to take a CSV file and find the common phrases and the count using Python 2.7. Currently I can only get individual words and their counts, but I need common phrases.

Here's my code so far:

import csv
from sys import argv
from collections import defaultdict
from collections import Counter
script, filename = argv
data = defaultdict(list)

with open (filename, 'rb') as f:
    reader = csv.reader(f)
    text_file = open("output.txt", "w")
    next(reader, None)
    for row in reader:
        data[row[2]].append(row[3])
        text_file.write("%r" % data)
    text_file.close()

print(data)
c = Counter(defaultdict)
print c.most_common(10)
Jongware
  • 22,200
  • 8
  • 54
  • 100
Turtle 23
  • 21
  • 4

2 Answers2

0

If you are going to be doing this for more than one file or for large files, I suggest using an indexing engine like Lucene.

You can Index n-grams (phrases of n-words) into Lucene and then use Lucene's query and search API to easily rank and find phrases with highest occurence.

Lucene is supported in Python with pylucene

Manvendra Gupta
  • 406
  • 5
  • 9
0

First, consider phrases using a natural language tokenizer. Even the simplest language has an enormous amount of subtleties on the definition of a sentence, i.e., trying to parse phrases with a regex is probably going to drive you crazy.

From there, use your approach on counting the frequency of "phrases", instead of words, as you are already doing, considering that "common phrases" means those that appear more than once. If that is not what you mean for "common phrases", than you should further clarify in your question.

Community
  • 1
  • 1
Eduardo
  • 4,282
  • 2
  • 49
  • 63