Faster way to count number of sets an item appears in?

Question

I've got a list of bookmarks. Each bookmark has a list of keywords (stored as a HashSet). I also have a set of all possible keywords ("universe").

I want to find the keyword that appears in the most bookmarks.

I have 1356 bookmarks with a combined total of 698,539 keywords, with 187,358 unique.

If I iterate through every keyword in the universe and count the number of bookmarks it appears in, I'm doing 254,057,448 checks. This takes 35 seconds on my machine.

The algorithm is pretty simple:

var biggest = universe.MaxBy(kw => bookmarks.Count(bm => bm.Keywords.Contains(kw)));

Using Jon Skeet's MaxBy.

I'm not sure it's possible to speed this up much, but is there anything I can do? Perhaps parallelize it somehow?

dtb's solution takes under 200 ms to both build the universe and find the biggest element. So simple.

var freq = new FreqDict();
foreach(var bm in bookmarks) {
    freq.Add(bm.Keywords);
}
var biggest2 = freq.MaxBy(kvp => kvp.Value);

FreqDict is just a little class I made built on top of a Dictionary<string,int>.

How much stress does it put on your CPU while you're waiting for those 35 seconds to pass? — IneedHelp, Aug 12 '12 at 07:06
@IneedHelp: Watching the Performance tab in Task Manager (Win7), the CPU usage jumps from 1% to about 25% then settles around 12%. I have 4 cores, hyperthreaded to 8. — mpen, Aug 12 '12 at 07:09
@Mark: have you considered counting the number of occurrences of each keyword when creating the universe? — dtb, Aug 12 '12 at 07:13
@dtb: No...that didn't occur to me. Creating the universe only takes 100ms; it's just doing a bunch of `UnionWiths`. If I used a `Dictionary` instead and did some counting, I imagine it wouldn't be much slower, and it would give me more information. Curious. Regardless though, I wanted to apply this recursively, with a "shrinking universe" -- i.e., I'm implementing a greedy solution to the set cover problem. I guess I can still apply this solution though, I just won't add any new entries the second time around. Will try tomorrow morning. Thanks! — mpen, Aug 12 '12 at 07:19
@Mark Apart from the counting, for parallelization you could always try and benchmark the `universe.AsParallel()...`, from the [`PLINQ`](http://msdn.microsoft.com/en-us/magazine/cc163329.aspx). Mind you, it doesn't *have* to be faster... — Patryk Ćwiek, Aug 12 '12 at 07:21
@Trustme-I'maDoctor: That saved me almost a full second. Better than nothing I guess. — mpen, Aug 12 '12 at 07:22
Looks like all 4 answers I've got are variants of the same thing dtb suggested, and what is now included in my question. — mpen, Aug 12 '12 at 17:25

Guffa · Answer 1 · 2012-08-12T10:10:08.823

4

You can get all keywords, group them, and get the biggest group. This uses more memory, but should be faster.

I tried this, and in my test it was about 80 times faster:

string biggest =
  bookmarks
  .SelectMany(m => m.Keywords)
  .GroupBy(k => k)
  .OrderByDescending(g => g.Count())
  .First()
  .Key;

Test run:

1536 bookmarks
153600 keywords
74245 unique keywords

Original:
12098 ms.
biggest = "18541"

New:
148 ms.
biggest = "18541"

edited Aug 12 '12 at 10:10

answered Aug 12 '12 at 07:49

Guffa

687,336
108
737
1,005

This is pretty slick. Try `.MaxBy(g => g.Count())` after grouping instead of using a sort. – erisco Aug 12 '12 at 07:59
@erisco: Thanks. Using `MaxBy` should be faster than sorting, but I can't see the difference in the test. I guess that that part of the operation is such a small part of the whole operation that it doesn't make a practical difference. – Guffa Aug 12 '12 at 08:04
for testing time, may be you should run them multiple times e.g 1000 times, because too many time will be spend for initialization. Also I'm pretty sure your approach with for loops runs faster than linq. – Saeed Amiri Aug 12 '12 at 10:25
@SaeedAmiri: Looping the items into a dictionary is basically the same thing as GroupBy does. There is always some overhead when using Linq, so you can always make it slightly faster by rewriting it yourself, but on the other hand you reduce the risk for bugs by using the methods in the framework. – Guffa Aug 12 '12 at 10:42
Yes, we could avoid overheads and improve performance and increase risks and decreasing code readability and maintenance, but it's all depends to the project, and I suggest this if OP has high concentration on performance. – Saeed Amiri Aug 12 '12 at 11:18
Using the `MaxBy` variant of this it takes about 295ms, using `OrderBy`, it takes about 380ms. Custom dictionary solution is still faster :) This is cool though. – mpen Aug 12 '12 at 17:35

score 2 · Answer 2 · answered Aug 12 '12 at 07:49

You don't need to iterate through whole universe. Idea is to create a lookup and track max.

    public Keyword GetMaxKeyword(IEnumerable<Bookmark> bookmarks)
    {
        int max = 0;
        Keyword maxkw = null;

        Dictionary<Keyword, int> lookup = new Dictionary<Keyword, int>();

        foreach (var item in bookmarks)
        {
            foreach (var kw in item.Keywords)
            {
                int val = 1;

                if (lookup.ContainsKey(kw))
                {
                    val = ++lookup[kw];
                }
                else
                {
                    lookup.Add(kw, 1);
                }

                if (max < val)
                {
                    max = val;
                    maxkw = kw;
                }
            }
        }

        return maxkw;
    }

Smart. I was using a dictionary but I didn't think to store the max at the same time. Let me see how much time that saves... — mpen, Aug 12 '12 at 17:37
Nope. Doesn't save any time. Might even cost a few ms. I think the extra operations on each add outweigh doing 1 iteration to find the max at the end. — mpen, Aug 12 '12 at 17:44

score 2 · Accepted Answer · answered Aug 12 '12 at 07:55

I don't have your sample data nor have I done any benchmarking, but I'll take a stab. One problem that could be improved upon is that most of the bm.Keywords.Contains(kw) checks are misses, and I think those can be avoided. The most constraining is the set of keywords any one given bookmark has (ie: it will typically be much smaller than universe) so we should start in that direction instead of the other way.

I'm thinking something along these lines. The memory requirement is much higher and since I haven't benchmarked anything, it could be slower, or not helpful, but I'll just delete my answer if it doesn't work out for you.

Dictionary<string, int> keywordCounts = new Dictionary<string, int>(universe.Length);
foreach (var keyword in universe)
{
    keywordCounts.Add(keyword, 0);
}

foreach (var bookmark in bookmarks)
{
    foreach (var keyword in bookmark.Keywords)
    {
        keywordCounts[keyword] += 1;
    }
}

var mostCommonKeyword = keywordCounts.MaxBy(x => x.Value).Key;

Accepting this one as it appears to be the fastest solution, although they're all quite similar. — mpen, Aug 13 '12 at 00:19

score 1 · Answer 4 · answered Aug 12 '12 at 12:25

1

50ms in python:

>>> import random

>>> universe = set()
>>> bookmarks = []
>>> for i in range(1356):
...     bookmark = []
...     for j in range(698539//1356):
...         key_word = random.randint(1000, 1000000000)
...         universe.add(key_word)
...         bookmark.append(key_word)
...     bookmarks.append(bookmark)
...
>>> key_word_count = {}
>>> for bookmark in bookmarks:
...     for key_word in bookmark:
...         key_word_count[key_word] = key_word_count.get(key_word, 0) + 1
...

>>> print max(key_word_count, key=key_word_count.__getitem__)
408530590

>>> print key_word_count[408530590]
3
>>>

answered Aug 12 '12 at 12:25

Rusty Rob

16,489
8
100
116

I'm curious why this is faster in Python. I'm essentially doing the same thing now. Maybe because my keys are strings? – mpen Aug 12 '12 at 17:51
TBH the 50ms isn't very accurate at all. It depends on computer speed, and yeah integers may be faster? not sure. – Rusty Rob Aug 12 '12 at 22:58

Faster way to count number of sets an item appears in?

4 Answers4

Linked