What is the best (scalable, fast, reliable) approach to implement an Activity Feed, Messaging Queue or RDBMS or NoSQL DBs?

Question

I need to build an activity feed (stream? A "lifestream" to be more accurate.) for a system similar (same) in resemblance to many popular social networking platforms. My initial attempt was to use an RDBMS but quickly dropped the idea due to the vast amounts of JOINs needed. Scavenging for other possible (and better-suited) approaches, I stumbled upon the following post:

How do social networking websites compute friend updates?

Taking the advise to make use of a message queue, I have spent some time studying RabbitMQ and its PubSubHubbub protocol. And I postulated the following approach:

1) Each user has a "topic"
2) Other users subscribe to the topic
3) When the user performs some action, a message is published which is then related (References resolved), formatted (Human-friendly language, links, etc.) and aggregated (X, Y and Z have commented on post P) with a PHP-script.

However, I would still have to go through each message and process it (unless my approach is completely wrong). So, what would the difference be between storing everything in a RDBMS and using a message queue (other than the implementation of the PubSubHubbub protocol)?

Are there more efficient ways to build such a system? (If so, please specify)

Comments / Suggestions / Criticisms are welcome. :)

Thank you in advance!

P.S.: There is an interesting article on how FriendFeed implements it ( http://bret.appspot.com/entry/how-friendfeed-uses-mysql ). However, I feel the "hackery" pushes MySQL out of it's comfortable domain (which is simply Relational Data and what would be the point of using an RDBMS without relational data?)

P.P.S.: Another issue using a message queue that I see (perhaps, due to me being new to this technology) is that once the message is fetched by the "Consumer", it is removed from the queue, however, I want it to persist for an arbitrary amount of time.

score 2 · Accepted Answer · answered Jan 19 '11 at 21:40

2

Some tips I would like to give you:

Don't use a RDBMS, but an in-memory(FAST) database like for example redis. As hopefully you agree with me from the redis benchmarks, redis is pretty fast. As another sidenote I would like to point out installing redis is child's play :).

make

There is a redis-client for PHP which uses C so that is also going to be very fast. - If I understand you correctly you think that pubsubhubbub is the same as a message queue but they aren't:

Parties (servers) speaking the PubSubHubbub protocol can get near-instant notifications (via webhook callbacks) when a topic (feed URL) they're interested in is updated.

Versus message queue:

In computer science, message queues and mailboxes are software-engineering components used for interprocess communication, or for inter-thread communication within the same process. They use a queue for messaging – the passing of control or of content.

You might think they are the same(they have some similarities), but they aren't the same. For my message queue I would redis(redis is very powerfull because it also has a basic message queue :)). You could put message(unit of work) onto a queue using rpush.

rpush <name of queue> <message>

Then from your worker processes you could receive messages from the queue using brpop(blocking pop :))

brpop <name of queue> 0

The workers process spawn are going to be started from the cli to stay in memory so aren't going to have overhead loading PHP in memory again and again.

php worker.php

I hope this is hopefully for you and if you might have any question I am very willing to answer them ;)

answered Jan 19 '11 at 21:40

Alfred

60,935
33
147
186

Alfred, firstly thank you for your reply and suggestions! Much appreciated. > If I understand you correctly you think that pubsubhubbub is the > same as a message queue but they aren't: Yes, I do understand the difference (if you read my post, you can see that I am using RabbitMQ as the message queue for PubSubHubbub protocol). I have, as you suggested been reading about redis (and it's excellent tutorial: http://redis.io/topics/twitter-clone). However, sending all the updates to all the subscribes (with a loop); isn't that a little resource-intensive (considering say a million records?) – shachibista Jan 20 '11 at 13:14
... adding to my previous comment: Shouldn't the model be in reverse? (Such that subscribers only fetch records when they need, i.e. not in advance)? I hope I was able to clarify. – shachibista Jan 20 '11 at 13:15
@a110y With redis everything is in memory(non blocking) so is going to be lightning fast. You could do that loop from the worker process(es) without any effort. You use the message queue to add pointer(reference to KEY) to tweet(message) to every user instead of doing massively expensive JOINS(SQL!!!) Also I would like to point out this excellent tutorial from Simon explaining redis => http://simonwillison.net/static/2010/redis-tutorial/. I hope this makes sense too ;) – Alfred Jan 20 '11 at 13:46
+1 for the helpful link. I will try and implement a prototype before going full speed. However, could you clarify your implementation a bit more? What would be the best way to achieve my goal; using pub/sub in Redis or the "inbox" method? Apologies in advance if my questions sound too obvious, I am extremely new to Redis (only 1 day of exposure :p) and I am having a hard time wrapping my RDBMS head around other database paradigms (K/V, Document-based, etc.) – shachibista Jan 20 '11 at 22:50
@a110y Sorry but I was on vacation, but I would start with pubsub(scales very good, but I think inbox will scale better) because that's very easy to implement(1 second ;)). But I think "inbox" will scale better, but that is in the future when you have a lot of users ;). You should first get it working and then think about the scaling issues I think... Although the inbox method could also be written pretty fast(within a(maybe couple) day ;)). I will try and write something up when I got some spare time... – Alfred Jan 24 '11 at 01:03

What is the best (scalable, fast, reliable) approach to implement an Activity Feed, Messaging Queue or RDBMS or NoSQL DBs?

1 Answers1