Set sizes for large number of sets

Question

I have a use case where I need to store a large number of entries with the unique set sizes for each one. If we simplify this to contacts (which the problem very much is not). We would have a problem like :

Given a user know how many friends they have:

Joe - Mary, John, Bob, Tom

Mary - Carol, Suzy, Mike, Fred, Robert

Thus friends(Joe) = 4 - the only operation supported is addFriend(Joe, Sam). While Mary might be friends with Joe there is no need to store any of that related information.

What I would rather not do is store all of the entries in each set, but yet a bloom filter doesn't quite feel right. Is there other alternatives?

Update: The challenge is that I've got 20M Joe/Mary/... in the top level sets with 4M semi-distinct members in each set. The quick code example is below (python for simplicity) - but at scale + persistent storage the universe comes to an end.

class World:
    def __init__(self):
        self.friends = defaultdict(set)

    def addFriend(self, id, member):
        self.friends[id].add(member)

    def friends(self, id):
        return len(friends[id])

What is the total size of members? Is it 20m or 24m or 80m? Ie, are the 4m members random/unique values or are they all contained in the 20m set headers? Also, what do you want to favour, small space, fast insert or fast retrieval? — rlb, Apr 05 '13 at 09:59
@dukeling - sorry the sets are distinct in their members, that's an error in my made up names. — koblas, Apr 05 '13 at 13:23
@rlb The total space of members is around 500m, which need to be attached to around 100m objects. Most sets will have 100K members, with outliers in the range of 5m. — koblas, Apr 05 '13 at 13:25

score 1 · Accepted Answer · edited May 23 '17 at 11:56

1

Since you're considering a Bloom filter, it sounds as though approximate answers would be OK. Use a small-space cardinality estimator like HyperLogLog in place of self.friends.

edited May 23 '17 at 11:56

Community

1
1

answered Apr 05 '13 at 12:38

David Eisenstat

64,237
7
60
120

Set sizes for large number of sets

1 Answers1