144

I have always wondered how Facebook designed the friend <-> user relation.

I figure the user table is something like this:

user_email PK
user_id PK
password 

I figure the table with user's data (sex, age etc connected via user email I would assume).

How does it connect all the friends to this user?

Something like this?

user_id
friend_id_1
friend_id_2
friend_id_3
friend_id_N 

Probably not. Because the number of users is unknown and will expand.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
Marin
  • 12,531
  • 17
  • 56
  • 80
  • 14
    There is a Facebook Engineering page that has a lot of this type of information, but not quite what you are asking. You may want to ask there and see if you can get an answer. http://www.facebook.com/FacebookEngineering – John Meagher Jun 17 '09 at 19:42
  • 1
    Google `graph database`. It for sure is **not** an RDBMS. –  Jan 11 '16 at 00:15

11 Answers11

91

Keep a friend table that holds the UserID and then the UserID of the friend (we will call it FriendID). Both columns would be foreign keys back to the Users table.

Somewhat useful example:

Table Name: User
Columns:
    UserID PK
    EmailAddress
    Password
    Gender
    DOB
    Location

TableName: Friends
Columns:
    UserID PK FK
    FriendID PK FK
    (This table features a composite primary key made up of the two foreign 
     keys, both pointing back to the user table. One ID will point to the
     logged in user, the other ID will point to the individual friend
     of that user)

Example Usage:

Table User
--------------
UserID EmailAddress Password Gender DOB      Location
------------------------------------------------------
1      bob@bob.com  bobbie   M      1/1/2009 New York City
2      jon@jon.com  jonathan M      2/2/2008 Los Angeles
3      joe@joe.com  joseph   M      1/2/2007 Pittsburgh

Table Friends
---------------
UserID FriendID
----------------
1      2
1      3
2      3

This will show that Bob is friends with both Jon and Joe and that Jon is also friends with Joe. In this example we will assume that friendship is always two ways, so you would not need a row in the table such as (2,1) or (3,2) because they are already represented in the other direction. For examples where friendship or other relations aren't explicitly two way, you would need to also have those rows to indicate the two-way relationship.

TheTXI
  • 37,429
  • 10
  • 86
  • 110
  • 14
    think of how inefficient this is though - you have to do a disjunctive query on the columns of the many-to-many, doubling search time on average. – Anthony Bishopric Apr 04 '11 at 18:29
  • 2
    Personally, I wouldn't want those two fields to make a composite primary key. A unique key, absolutely. The clustered index on that unique key, definitely. But I'd also put some sort of non-composite identity as the PK with a nonclustered index. That would allow other tables that need a "friend relationship ID" FK to easily tie to this table and various triggers could fire to cascade events of friending, defriending, etc. – Jesse C. Slicer May 14 '12 at 22:05
  • 1
    It said that Facebook has around 1'000'000'000 users. If the average user has 100 friends, that means the table would contain 100'000'000'000 rows. MySQL partitioning? – veidelis Jun 04 '14 at 07:30
  • Forget this approach. If you get a serious amount of users it will definitely become *very* slow. See my answer and try the to benchmark it yourself. I've done some benchmarking with 10k users and 2.5 million friendship connections and the result was disappointing. If you run a small community it will work fine but there are performance issues to consider. – floriank Mar 21 '15 at 09:44
  • 13
    **you can be sure that facebook does not use a RDBMS for this, it is common knowledge that they, twitter and everyone else that needs to run queries like this use a graph database of some flavor.** there is at least 69 people that have never worked at any kind of scale or do not know how to do math at scale. –  Jan 11 '16 at 00:17
63

TL;DR:

They use a stack architecture with cached graphs for everything above the MySQL bottom of their stack.

Long Answer:

I did some research on this myself because I was curious how they handle their huge amount of data and search it in a quick way. I've seen people complaining about custom made social network scripts becoming slow when the user base grows. After I did some benchmarking myself with just 10k users and 2.5 million friend connections - not even trying to bother about group permissions and likes and wall posts - it quickly turned out that this approach is flawed. So I've spent some time searching the web on how to do it better and came across this official Facebook article:

I really recommend you to watch the presentation of the first link above before continue reading. It's probably the best explanation of how FB works behind the scenes you can find.

The video and article tells you a few things:

  • They're using MySQL at the very bottom of their stack
  • Above the SQL DB there is the TAO layer which contains at least two levels of caching and is using graphs to describe the connections.
  • I could not find anything on what software / DB they actually use for their cached graphs

Let's take a look at this, friend connections are top left:

enter image description here

Well, this is a graph. :) It doesn't tell you how to build it in SQL, there are several ways to do it but this site has a good amount of different approaches. Attention: Consider that a relational DB is what it is: It's thought to store normalised data, not a graph structure. So it won't perform as good as a specialised graph database.

Also consider that you have to do more complex queries than just friends of friends, for example when you want to filter all locations around a given coordinate that you and your friends of friends like. A graph is the perfect solution here.

I can't tell you how to build it so that it will perform well but it clearly requires some trial and error and benchmarking.

Here is my disappointing test for just findings friends of friends:

DB Schema:

CREATE TABLE IF NOT EXISTS `friends` (
`id` int(11) NOT NULL,
  `user_id` int(11) NOT NULL,
  `friend_id` int(11) NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8;

Friends of Friends Query:

(
        select friend_id
        from friends
        where user_id = 1
    ) union (
        select distinct ff.friend_id
        from
            friends f
            join friends ff on ff.user_id = f.friend_id
        where f.user_id = 1
    )

I really recommend you to create you some sample data with at least 10k user records and each of them having at least 250 friend connections and then run this query. On my machine (i7 4770k, SSD, 16gb RAM) the result was ~0.18 seconds for that query. Maybe it can be optimized, I'm not a DB genius (suggestions are welcome). However, if this scales linear you're already at 1.8 seconds for just 100k users, 18 seconds for 1 million users.

This might still sound OKish for ~100k users but consider that you just fetched friends of friends and didn't do any more complex query like "display me only posts from friends of friends + do the permission check if I'm allowed or NOT allowed to see some of them + do a sub query to check if I liked any of them". You want to let the DB do the check on if you liked a post already or not or you'll have to do in code. Also consider that this is not the only query you run and that your have more than active user at the same time on a more or less popular site.

I think my answer answers the question how Facebook designed their friends relationship very well but I'm sorry that I can't tell you how to implement it in a way it will work fast. Implementing a social network is easy but making sure it performs well is clearly not - IMHO.

I've started experimenting with OrientDB to do the graph-queries and mapping my edges to the underlying SQL DB. If I ever get it done I'll write an article about it.

How can I create a well performing social network site?

Update 2021-04-10: I'll probably never ever write the article ;) but here are a few bullet points how you could try to scale it:

  • Use different read and write repositories
  • Build specific read repositories based on faster non-relational DB systems made for that purpose, don't be afraid of denormalizing data. Write to a normalized DB but read from specialized views.
  • Use eventual consistence
  • Take a look at CQRS
  • For a social network graphs based read repositories might be also good idea.
  • Use Redis as a read repository in which you store whole serialized data sets

If you combine the points from the above list in a smart way you can build a very well performing system. The list is not a "todo" list, you'll still have to understand, think and adept it! https://microservices.io/ is a nice site that covers a few of the topics I mentioned before.

What I do is to store events that are generated by aggregates and use projects and handlers to write to different DBs as mentioned above. The cool thing about this is, I can re-build my data as needed at any time.

floriank
  • 25,546
  • 9
  • 42
  • 66
  • 1
    so.. did you ever get around to write the article? – Just a coder Apr 15 '16 at 00:27
  • 1
    No, I'm pretty busy besides doing programming and haven't had the time and mood to do so. The answer here contains everything you need to know if you want to implement performant friend associations. Either cache the friendlists per user or map your relational DB in parts or the whole thing to a graph and query the graph DB. You can use OrientDB or Neo4j for that. I would love to write my own open source social networking software but there is a ton of other things to do as well. Whatever you do: Do benchmarks. :) – floriank Apr 15 '16 at 11:58
  • Still no. But the OrientDB documentation explains the friend connections and everything else can be modelled once the basics are understood. http://orientdb.com/docs/2.1/Tutorial-Working-with-graphs.html If you want to use a relational DB as foundation then you just need to add some code in your "after save" and "after delete" callbacks to update your graph DB (which you would use for reading data). If you don't have such callbacks implement them but I guess almost all kind of ORM implementations and frameworks have something like that. Actually OrientDB can store documents as well. – floriank Nov 02 '16 at 15:28
  • 1
    Still no but we do something similar at work: We map our relational data to an Elastic Search index, as I wrote in my comment before, it's simply a matter of getting the data you want to store in the index or graph after a certain action (afterSave() / afterDelete() callback in our case) and then updating the index or graph. Pretty simple? :) Same could be done with the friend lists by the way, it doesn't really matter if you store them in ES, a graph or a memory based cache (as long as you have enough RAM). It's really not hard, the hard part is to make the whole thing scale when you grow. – floriank Sep 08 '17 at 11:24
55

Have a look at the following database schema, reverse engineered by Anatoly Lubarsky:

Facebook Schema

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
  • 9
    This is a class diagram, not a database schema – PeakGen Jan 13 '15 at 10:34
  • 2
    So would each "User" have there own dedicated database? Like the one above? How would it work? E.g When the user logs on FB checks to see if it's a valid User + Pass and then if it's valid facebook will redirect them to there database which then displays everything from the above database – James111 May 02 '15 at 04:58
  • This Store only the information related to the user, I'm specifically searching for the Post and its audience? – Waseem Ahmad Naeem Apr 05 '19 at 06:37
37

My best bet is that they created a graph structure. The nodes are users and "friendships" are edges.

Keep one table of users, keep another table of edges. Then you can keep data about the edges, like "day they became friends" and "approved status," etc.

belgariontheking
  • 1,351
  • 1
  • 10
  • 15
  • 49
    I have a feeling you're going to have to explain that a bit more for some people here. – TheTXI Jun 17 '09 at 19:22
  • 4
    I think a more interesting question would be how to persist such a huge structure (we're talking about 200 million nodes and billions of edges) in a way that it can easily be searched and updated. – Dirk Vollmar Jun 17 '09 at 19:42
21

It's most likely a many to many relationship:

FriendList (table)

user_id -> users.user_id
friend_id -> users.user_id
friendVisibilityLevel

EDIT

The user table probably doesn't have user_email as a PK, possibly as a unique key though.

users (table)

user_id PK
user_email
password
Nathan Koop
  • 24,803
  • 25
  • 90
  • 125
  • 5
    While this certainly makes the most sense, I would think the performance would be horrendous given how many users Facebook has and how many friends each Facebook user has. – Kevin Pang Jun 17 '09 at 21:42
19

Take a look at these articles describing how LinkedIn and Digg are built:

There's also "Big Data: Viewpoints from the Facebook Data Team" that might be helpful:

http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html

Also, there's this article that talks about non-relational databases and how they're used by some companies:

http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php

You'll see that these companies are dealing with data warehouses, partitioned databases, data caching and other higher level concepts than most of us never deal with on a daily basis. Or at least, maybe we don't know that we do.

There are a lot of links on the first two articles that should give you some more insight.

UPDATE 10/20/2014

Murat Demirbas wrote a summary on

  • TAO: Facebook's distributed data store for the social graph (ATC'13)
  • F4: Facebook's warm BLOB storage system (OSDI'14)

http://muratbuffalo.blogspot.com/2014/10/facebooks-software-architecture.html

HTH

Adrian J. Moreno
  • 14,350
  • 1
  • 37
  • 44
12

It's not possible to retrieve data from RDBMS for user friends data for data which cross more than half a billion at a constant time so Facebook implemented this using a hash database (no SQL) and they opensourced the database called Cassandra.

So every user has its own key and the friends details in a queue; to know how cassandra works look at this:

http://prasath.posterous.com/cassandra-55

BenMorel
  • 34,448
  • 50
  • 182
  • 322
user362541
  • 121
  • 1
  • 3
6

Its a type of graph database: http://components.neo4j.org/neo4j-examples/1.2-SNAPSHOT/social-network.html

Its not related to Relational databases.

Google for graph databases.

zain
  • 61
  • 1
  • 1
5

You're looking for foreign keys. Basically you can't have an array in a database unless it has it's own table.


Example schema:

    Users Table
        userID PK
        other data
    Friends Table
        userID   -- FK to users's table representing the user that has a friend.
        friendID -- FK to Users' table representing the user id of the friend
Community
  • 1
  • 1
Malfist
  • 31,179
  • 61
  • 182
  • 269
0

Probably there is a table, which stores the friend <-> user relation, say "frnd_list", having fields 'user_id','frnd_id'.

Whenever a user adds another user as a friend, two new rows are created.

For instance, suppose my id is 'deep9c' and I add a user having id 'akash3b' as my friend, then two new rows are created in table "frnd_list" with values ('deep9c','akash3b') and ('akash3b','deep9c').

Now when showing the friends-list to a particular user, a simple sql would do that: "select frnd_id from frnd_list where user_id=" where is the id of the logged-in user (stored as a session-attribute).

deep9c
  • 17
  • 1
-1

Regarding the performance of a many-to-many table, if you have 2 32-bit ints linking user IDs, your basic data storage for 200,000,000 users averaging 200 friends apiece is just under 300GB.

Obviously, you would need some partitioning and indexing and you're not going to keep that in memory for all users.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265