0

Facebook has the ability to order users (e.g. in the search) according to the mutual friend count. Another example is the friend finder. The order is more or less the same thing.

My Question is: How can they keep track of the mutual friend counts, since you have friends of friends? How can they order friends in such a short time?

If we just assume that every user has 100 friends, simply that would in a worst case mean that for every person there must be n^2 = 10'000 entries per user in such an index.

There must be some indexing technique, but I really wonder how they do that on the database level.

Dave Halter
  • 15,556
  • 13
  • 76
  • 103
  • At work we have a pretty huge person database. Some connections of these persons exist (like facebooks friendships). It would be really awesome to display the most **relevant** "mutual" friends. And to answer your question, I haven't tried very much. Because I know that such a query with millions of users and connections would be painfully slow. – Dave Halter Mar 18 '12 at 15:34
  • if you want to ask how you could implement the query to order friends by the number of mutual friends, ask so.. instead of asking how is Facebook able to do it.. – Aprillion Mar 18 '12 at 16:14
  • I know how to write the query, but as I mentioned below on your "answer", I want to know how such a search works. – Dave Halter Mar 18 '12 at 23:01
  • have you read http://stackoverflow.com/questions/1108/how-does-database-indexing-work? – Aprillion Mar 19 '12 at 00:02
  • I do understand indexes. – Dave Halter Mar 25 '12 at 13:57

4 Answers4

1

Most likely they pre-calculate the results and store them in a distributed KV data base. Here is an explanation of how digg does something similar: http://nosqleast.com/2009/slides/sarkissian-cassandra.pdf

In a nutshell. For each pair of users they store a count of their common friends. Every time a user adds a new friend they increment the mutual friends count for all the appropriate pairs(notice how all the work is being done on the DB write, as opposed to read read). You consume a lot of memory, but the reads are really fast.

EugeneMi
  • 3,475
  • 3
  • 38
  • 57
1

Facebook stores users and relations in a graph database (see https://developers.facebook.com/docs/opengraph/). I'm not whether that's their primary internal data storage solutions (as far as I know they use Apache Cassandra which is NoSQL but column oriented similar to Google's BigTable), but at least they do have access to a graph of all users on Facebook. Graphs allow for interesting traversal techniques, which are far more powerful and performant for such data than conventional SQL queries.

Using for example the shortest path algorithm, it's very easy to find all friends of friends: See How to calculate mutual friends with neo4j?

Here is also an interesting blogpost by Emil Eifrem (one of the creators of Neo4j) concerning Facebook's Open Graph: http://blogs.neotechnology.com/emil/2010/04/on-the-facebook-open-graph-and-graph-databases.html

Community
  • 1
  • 1
Danilo Bargen
  • 18,626
  • 15
  • 91
  • 127
0

They can do so because they own that data and have direct access to getting it whereas we developers are funneled thru their API which has limitations (as well it should in most cases). They have teams of people assigned to ensure the data is indexed, stored, paginated and cached at just the right places to make the user experience the way it is.

DMCS
  • 31,720
  • 14
  • 71
  • 104
0

i don't see the n^2 index, i'm affraid... let's say table friendships has 100 entries per user with 100 friends - like this:

user_id friend_id
1       2
1       3
2       1
2       ...

then i would select the count like this + store the result to a cached variable in my profile...

with my_friends_view (friend_id) as (
  select friend_id
  from friendship
  where user_id = @my_user_id
)
select user_id "my_friend_id", count(*) "mutual_friends_count"
from friendship
where user_id in my_friends_view
and friend_id in my_friends_view
Aprillion
  • 21,510
  • 5
  • 55
  • 89
  • A view is not really what I'm searching for. The n^2 index is there, you just did it with your view, and that is no solution for this problem, because it's just an SQL Query. I suspect something like spatial indexing. But I'm no expert in that field. – Dave Halter Mar 17 '12 at 11:09
  • obviously, i don't understand why there has to be 1 `n^2` index instead of 2 `n` indexes?? care to enlighten me? – Aprillion Mar 17 '12 at 11:41
  • if you have n friends, and you have to do a lookup for every friend, you end up having n^2 lookups. Counting makes no difference, because DBMS like postgres have to fetch the rows and then return the count. – Dave Halter Mar 18 '12 at 15:38
  • I have really big problems to fully understand your query, sorry about that, but I tried now too long :-). I think the query is to complicated and has still a n^2 in it (through the in statement and the select on friendship). A 2n index would be something like selecting 1 id and then selecting entries from that id. But what you do is really selecting n entries and then make an other select on **each** of those entries - which equals **n^2**. – Dave Halter Mar 18 '12 at 16:09
  • **a)** the index size does not depend on the query efficiency, 2 indexes for the 2 columns are completelly sufficient and **b)** the time complexity of the query - `Θ(m * log n)`, where **m** (number of my friends, in 100s) **<< n** (number of all users, in 1 000 000s) - is far from `Θ(n^2)` (it will access an index for every of my 1-m friends in 'log n' time, not in 'n') – Aprillion Mar 18 '12 at 16:28
  • at least Oracle will access the index for `in` operator, i'm not so sure about Postgres :( – Aprillion Mar 18 '12 at 16:32
  • Sorry, I think I really forgot what the question was originally all about. As mentioned in the question above, I wanted to know (a year ago), how the facebook search works. The special thing is, that they are actually ordering with the mutual friend count. That means, there is either an index or a query. You are right about the O(n*log(n)), but that's just the proper optimization of the index querying. In the question I was not talking about the index, but rather O-Notation, but rather the index size. – Dave Halter Mar 18 '12 at 23:16
  • As you may guess, doing a query for every user in the facebook database is not really an option, even if it is just O(log(n)). – Dave Halter Mar 18 '12 at 23:16
  • oh, i didn't notice there this is so old an question ;)) i think we just disagree about the time to access 1 of 1 000 000 000 frienships - using an index of the size required for 5 000 000 unique ids (each having 200 friends in average) - it would require less than 30 block accesses - and `O(m*log(n))` is far from `O(n*log(n))` in practical terms for m = 200 and n = 5 000 000 => log2(n) = 29,897 – Aprillion Mar 18 '12 at 23:59
  • are you familiar with [index only scans](http://rhaas.blogspot.com/2010/11/index-only-scans.html)? count(*) in Postres 9.1 [does not seem to use indexes](http://postgresql.1045698.n5.nabble.com/COUNT-and-index-only-scans-td4888930.html) so you might experience O(n) instead of O(log(n)) – Aprillion Mar 19 '12 at 01:32
  • Didn't know that 9.1 has index only scans. That's a huge thing, thank you for the hint. But now to your answer: The O(m*log(n) might be right, but it's just a different question. In the facebook search people there is a sort algorithm, that also includes more or less the whole world (or at least it tries to, I've seen errors). And then it orders those with mutual friends in the front (in the meantime they have changed much, but people with many mutual friends are just in front of your search results). This is much more complicated than what you describe. – Dave Halter Mar 25 '12 at 14:27