How to count the documents which contain a specific word?

Question

Assuming I have a dict like this:

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}

And what I want to do is like calculating "Document Frequency": assuming each dictionary item is a document, and I have a specific word, so how many documents contain that word?

I've seen many posts telling me how to calculate frequency, but here if "a" appears twice in document "alpha", I just need the count to be 1. So the "frequency" of "a" should be 1, and "c" should be 2.

I know I can iterate the whole documents dictionary, and add the counter when finding the word in a document. Or I can firstly make the words in every document unique, then combine all the documents and count the word.

But I think there's a better way, a more effective way. Any ideas?

BTW, is there any way I can keep the structure of the dict? In this example, I'd like to get a result of {"alpha": {'c': 2, 'b': 2, 'a': 1}, "bravo": {'c': 2, 'b': 2, 'd': 1}

Update

If here I have just a list (something like [["a", "b", "c", "a", "b"], ["b", "c", "d", "c", "d"]]), how can I get a result list like [[1, 2, 2, 0], [0, 2, 2, 1]].

I've got no idea. The point is to expand each list and assure the order of the terms. Thoughts?

I do not understand your goal. You want to count how many times a character/word is in a document? But you do not want the actual count? Why does `"a"` have a count of `1` and `"c"` have a count of `2`? That doesn't make sense to me. — Cory Kramer, Apr 01 '14 at 12:38
@Cyber I want to count how many documents contain that word. In my example, only `"alpha"` contains `"a"`, so it should be 1 (although twice in `"alpha"`), and `"c"` is in both `"alpha"` and `"bravo"`, so it's 2 (although there're 3 `"c"`s in total). — Melkor, Apr 03 '14 at 15:05

Ashwini Chaudhary · Accepted Answer · 2014-04-03T19:10:09.493

6

I'd go with your second way using collections.Counter and set.

>>> from collections import Counter
>>> sum((Counter(set(x)) for x in docDict.itervalues()), Counter())
Counter({'c': 2, 'b': 2, 'a': 1, 'd': 1})

Update 1:

>>> c = sum((Counter(set(x)) for x in docDict.itervalues()), Counter())
>>> {k: {k1:c[k1] for k1 in set(v)} for k, v in docDict.iteritems()}
{'alpha': {'a': 1, 'c': 2, 'b': 2}, 'bravo': {'c': 2, 'b': 2, 'd': 1}}

update 2::

If performance is an concern then don't use Counter with sum, here another way to do it. Note that unlike @user2931409 answer I am not keeping a set of words in memory just to get their length, so this is much more memory efficient but slightly slower than their answer.

result = Counter()
for v in docDict.itervalues():
    result.update(set(v))
return result

Timing comparison:

def func1():
    #http://stackoverflow.com/a/22787509/846892
    result = defaultdict(set)
    for k, vlist in docDict.items():
        for v in vlist:
            result[v].add(k)
    return dict(zip(result.keys(), map(lambda x:len(x), result.values())))

def func2():

    result = Counter()
    for v in docDict.itervalues():
        result.update(set(v))
    return result

In [94]: docDict = {''.join(random.choice(lis) for _ in xrange(8)): random.sample(lis, 25)
    ...:   for _ in xrange(70000)}

In [95]: %timeit func1(docDict)
1 loops, best of 3: 380 ms per loop

In [96]: %timeit func2(docDict)
1 loops, best of 3: 591 ms per loop

In [97]: docDict = {''.join(random.choice(lis) for _ in xrange(8)): random.sample(lis, 25)
    ...:   for _ in xrange(10**5)}

In [98]: %timeit func1(docDict)
1 loops, best of 3: 529 ms per loop

In [99]: %timeit func2(docDict)
1 loops, best of 3: 848 ms per loop

In [101]: func1(docDict) == func2(docDict)
Out[101]: True

edited Apr 03 '14 at 19:10

answered Apr 01 '14 at 12:40

Ashwini Chaudhary

244,495
58
464
504

Thanks! I never thought it could be so compact and elegant! – Melkor Apr 03 '14 at 15:06
But... Is there any way I can keep the structure of the dict? Like... `{"alpha": {'c': 2, 'b': 2, 'a': 1}, "bravo": {'c': 2, 'b': 2, 'd': 1}}` – Melkor Apr 03 '14 at 16:01
@Melkor Check the updated answer, you'll need an additional loop for that. – Ashwini Chaudhary Apr 03 '14 at 16:11
Hi, I've tested it, but it runs really slow. When I processed more than 70 thousands words it took me nearly 300 seconds. But the way @user2931409 said is really fast. – Melkor Apr 03 '14 at 17:52
@Melkor I've another answer. – Ashwini Chaudhary Apr 03 '14 at 19:10
Yeah you win... It's looks really simpler. – Melkor Apr 04 '14 at 05:58
And comparing with `dict.items()`, will `dict.iteritems()` be slower? Theoretically it should be (as it doesn't store everything in memory), but in my test they're generally the same in execution time. BTW, python3 has change the behavier of `dict.items()` to something just like `dict.iteritems()`, but I found python3.3 runs about 20% to 30% slower than python2.7. – Melkor Apr 04 '14 at 06:02
@Melkor That depends on size actually, `dict.items()` will have to create the list in memory first and then start the iteration over it, OTOH `dict.iteritems()` calls the `__next__` method during the loop to get one value at a time. For small to mid-sized lists `.items()` can easily beat `.iteritems()`, but as the size increased further `dict.iteritems` is going to be a better option due to O(1) memory and no list creation in the first place. This might help: http://stackoverflow.com/q/11964130/846892 – Ashwini Chaudhary Apr 04 '14 at 08:11
I've updated the question, proposing another similar problem. Do you have some good ideas? – Melkor Apr 04 '14 at 17:08

Kei Minagawa · Answer 2 · 2014-04-04T15:36:59.433

2

This is not special one, very ordinary way.

from collections import defaultdict

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}
result = defaultdict(set)

for k, vlist in docDict.items():
    for v in vlist:
        result[v].add(k)

#Now the result looks like this.
#{'a': set(['alpha']), 'c': set(['alpha', 'bravo']), 'b': set(['alpha', 'bravo']), 'd': set(['bravo'])})

print dict(zip(result.keys(), map(lambda x:len(x), result.values())))
#{'a': 1, 'c': 2, 'b': 2, 'd': 1}

update

Another way... just counting. And changed to use iterator. So it's more faster than above code.

from collections import defaultdict
def func3(docDict):
    result = defaultdict(int)
    for vlist in docDict.itervalues():
        for i in set(vlist):
            result[i] += 1
    return dict(result)

edited Apr 04 '14 at 15:36

answered Apr 01 '14 at 13:33

Kei Minagawa

4,395
3
25
43

I have to say it's really magic. It only took about 2 seconds processing more than 70 thousands words of more than 3 thousands lines! As for keeping the structure of the dict, I just create a new dict and iterate the original dict to map the result from this counter. Still really fast. – Melkor Apr 03 '14 at 17:50
@Melkor: I didn't know that `set` function and `for-loop` is such fast. Thanks for telling me. Anyway I upload more faster one.:) – Kei Minagawa Apr 04 '14 at 15:38

score 1 · Answer 3 · answered Apr 01 '14 at 12:40

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}
revDict = {v : sum(1 for l in docDict.values() if v in l)  
        for v in set(x for y in docDict.values() for x in y) }
print revDict

Gives:

{'a': 1, 'c': 2, 'b': 2, 'd': 1}

score 1 · Answer 4 · answered Apr 01 '14 at 12:46

You can use set to unify the characters in a single document. Then simply Counter() them.

from collections import Counter

docDict = {"alpha": ["a", "b", "c", "a", "b"], "bravo": ["b", "c", "d", "c", "d"]}

result = reduce(lambda x, y: x + Counter(set(y)), docDict.itervalues(), Counter([]))

How to count the documents which contain a specific word?

4 Answers4