I'd need to build a simple analytics back-end for capturing user behaviour. This will be captured via a Javascript snippet on a webpage just like Google Analytics or Mixpanel data.
The system needs to capture close-to-realtime browser data (scrolling position of page, mouse position etc.) It will record the state of the users' page every 5 seconds. There are only three attributes on each measurement but they are have to be taken frequently.
The data doesn't necessarily need to be sent every 5 seconds, it could be bussed up less frequently however it's imperative that I get all of the data while the user is on the page. i.e. I can't bus it once per minute and lose the last 59 seconds of data for someone who leaves after 119 seconds.
If possible I'd like to build a system that will scale for the foreseeable future which means it working for 10,000 sites, each with 100 concurrent visitors, i.e. 100,000 concurrent users each sending one event every 5 seconds.
I'm not worried about querying the data, that can be done using a separate system. I'm most interested in how to handle the capture of the data itself.
Requirements
Based on the budgeting above, the system needs to handle 20,000 events per second coming from a pool of 100,000 users.
I'd like to host this service on Heroku however while I've done a lot of work with Rails, I'm completely new to high throughput systems (other than knowing you don't process them using Rails).
Questions
- Is there a commercial system that would be good for doing this (like Pusher but for data capture as well as distribution)?
- Should I be looking to do this using HTTP requests or websockets?
- Is node.js the right choice for this or just trendy?
- If I were to chose a socket-based solution, how many sockets can a dyno on Heroku handle for each webserver
- What are the pertinent considerations for choosing between Mongo / Reddis etc. for storage
- Is this the type of problem which actually requires two solutions - the first to get you to reasonable scale quickly and inexpensively and the second to take you past that scale on lower incremental cost but with more development effort required upfront?