3

I have a use case where I need to store a large number of entries with the unique set sizes for each one. If we simplify this to contacts (which the problem very much is not). We would have a problem like :

Given a user know how many friends they have:

Joe - Mary, John, Bob, Tom

Mary - Carol, Suzy, Mike, Fred, Robert

Thus friends(Joe) = 4 - the only operation supported is addFriend(Joe, Sam). While Mary might be friends with Joe there is no need to store any of that related information.

What I would rather not do is store all of the entries in each set, but yet a bloom filter doesn't quite feel right. Is there other alternatives?

Update: The challenge is that I've got 20M Joe/Mary/... in the top level sets with 4M semi-distinct members in each set. The quick code example is below (python for simplicity) - but at scale + persistent storage the universe comes to an end.

class World:
    def __init__(self):
        self.friends = defaultdict(set)

    def addFriend(self, id, member):
        self.friends[id].add(member)

    def friends(self, id):
        return len(friends[id])
Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
koblas
  • 25,410
  • 6
  • 39
  • 49
  • which programming language are you working on? – Raptor Apr 05 '13 at 04:27
  • Pretty much any - making a quick update. – koblas Apr 05 '13 at 04:35
  • What is the output of friends(Mary) ? – sowrov Apr 05 '13 at 04:41
  • Isn't `friends(Mary) = 6` since Joe and Mary are friends? – Bernhard Barker Apr 05 '13 at 06:55
  • What is the total size of members? Is it 20m or 24m or 80m? Ie, are the 4m members random/unique values or are they all contained in the 20m set headers? Also, what do you want to favour, small space, fast insert or fast retrieval? – rlb Apr 05 '13 at 09:59
  • @dukeling - sorry the sets are distinct in their members, that's an error in my made up names. – koblas Apr 05 '13 at 13:23
  • @rlb The total space of members is around 500m, which need to be attached to around 100m objects. Most sets will have 100K members, with outliers in the range of 5m. – koblas Apr 05 '13 at 13:25

1 Answers1

1

Since you're considering a Bloom filter, it sounds as though approximate answers would be OK. Use a small-space cardinality estimator like HyperLogLog in place of self.friends.

Community
  • 1
  • 1
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120