I have a data set as follows:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
and so on
I want to find out the most frequently occurring word-pairs e.g.
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
The two words could be in any order and at any distance from each other
Can someone suggest a possible solution in python? This is a very large data set.
Any suggestion is highly appreciated
So this is what I tried after suggestions from @275365
@275365 I tried the following with input read from a file
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.