5

I have a data set as follows:

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"

and so on

I want to find out the most frequently occurring word-pairs e.g.

(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)

The two words could be in any order and at any distance from each other

Can someone suggest a possible solution in python? This is a very large data set.

Any suggestion is highly appreciated

So this is what I tried after suggestions from @275365

@275365 I tried the following with input read from a file

    def collect_pairs(file):
        pair_counter = Counter()
        for line in open(file):
            unique_tokens = sorted(set(line))  
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
            print pair_counter

    file = ('myfileComb.txt')
    p=collect_pairs(file)

text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.

Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
ajaanbaahu
  • 344
  • 3
  • 20
  • is `'two and a half men'` one token, or is it five tokens? – roippi Jan 23 '14 at 01:53
  • it is one token, anything in " " is a single token – ajaanbaahu Jan 23 '14 at 01:57
  • 1
    What have you attempted? – RyPeck Jan 23 '14 at 02:41
  • Using brute force to solve is an options where I calculate possible number of permutations of unique tokens and run through the entire data set. But this will fail even if the dataset is a little large. – ajaanbaahu Jan 23 '14 at 02:58
  • What do you mean "could be...at any distance from each other"? In your example data set, is `(Statistics,TopGear)` a pair? Or can we assume that only words from the same line can be paired? – steveha Jan 23 '14 at 03:09
  • @steveha at any distance meaning it could be as let say few line have "Statistics","regression","logistics","convex",.....,"Estimation", "TopGear" (or nay permutations of this words)Now (Statistics and Estimation) and (Statistics and TopGear) are not placed consecutively but still in the same line, hence form a pair. So it doesn't matter at what position after the first or the order the words, exist in the same line. But yes all the words to be paired need to be in the same line – ajaanbaahu Jan 23 '14 at 17:31
  • This previous question covers the same topic: http://stackoverflow.com/questions/5560479/get-sorted-combinations – Tom Morris Jan 24 '14 at 17:51

2 Answers2

7

You might start with something like this, depending on how large your corpus is:

>>> from itertools import combinations
>>> from collections import Counter

>>> def collect_pairs(lines):
    pair_counter = Counter()
    for line in lines:
        unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter

The result:

>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]

Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.

EDIT: Working with a file object

The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval:

from itertools import combinations
from collections import Counter
import ast

def collect_pairs(file_name):
    pair_counter = Counter()
    for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
        unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter  # return the actual Counter object

Now you can test it like this:

file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10)  # for example
Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
  • I will try this solution as soon as possible. How do you expect it to scale for large data set of lets say 10K unique tokens and a file 1TB in size having multiple pairs in the same line? – ajaanbaahu Jan 23 '14 at 03:02
  • @user3197086 Wow, 1TB, eh? It might take a while to find all the combos present in 1 TB of data, with multiple combos per line. You might also explore set intersections as a way to avoid exploring every combo individually. – Justin O Barber Jan 23 '14 at 03:07
  • 1
    10K unique tokens is only 100M combinations and you won't have that many since the combinations will be sparse. In the code above, you should instead use `combinations(set(line),2)` and then drop the `sorted`. 1TB, on the other hand, may call for you to parallelize things on a cluster (depending on whether this is a one-time task or an ongoing thing). – Tom Morris Jan 23 '14 at 19:18
  • @Tom Morris made some excellent suggestions. I changed my answer accordingly. – Justin O Barber Jan 23 '14 at 19:37
  • @Tom Morris I had to retract one of the changes. I believe you have to use `sorted` after all, since `combinations` does not do the kind of sorting required here. For example, `list(combinations(set(['z', 't', 'a']), 2))` can output `[('t', 'z'), ('t', 'a'), ('z', 'a')]`. That means that for any given tuple, one word might occur before the other and would thus confuse the Counter. The other suggestions were excellent. `combinations` just does not sort alphabetically, although it will preserve sorting. – Justin O Barber Jan 24 '14 at 03:55
  • @275365 Please see the code that I posted and tell me where I am wrong – ajaanbaahu Jan 25 '14 at 18:30
  • 1
    @user3197086 I edited my answer with a corrected version of your function. Let me know if you cannot trust the data and need to use a regex instead. – Justin O Barber Jan 25 '14 at 18:52
  • user @275365, you nailed it my friend,thanks a bunch! I just tested on my sample data set and it seems to work fine. I am now gonna run it on a 800MB data set. Any idea how to use hadoop streaming and make a map reduce task since for a bigger data set number of combinations are going to be huge? – ajaanbaahu Jan 26 '14 at 00:42
  • @user3197086 I'm glad it's working for you so far. Unfortunately, I'm not quite sure how to do a hadoop mapreduce program, but [this website looks promising](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/), especially [here](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/#improved-mapper-and-reducer-code-using-python-iterators-and-generators). After you make a first attempt, you could always pose that as another question here on StackOverflow. I'm sure someone could help you out. – Justin O Barber Jan 26 '14 at 01:04
  • @user275365 Awesome, I will do that – ajaanbaahu Jan 26 '14 at 01:16
0

There is not that much you can do, except counting all pairs.

Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a<b (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).

If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).

It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194