5

I have a bookmarklet that, when used, submits all of the URLs on the current browser page to a Rails 3 app for processing. Behind the scenes I'm using Typhoeus to check that each URL returns a 2XX status code. Currently I initiate this process via an AJAX request to the Rails server and simply wait while it processes and returns the results. For a small set, this is very quick, but when the number of URLs is quite large, the user can be waiting for up to, say, 10-15 seconds.

I've considered using Delayed Job to process this outside the user's thread, but this doesn't seem like quite the right use-case. Since the user needs to wait until the processing is finished to see the results and Delayed Job may take up to five seconds before the job is even started, I can't guarantee that the processing will happen as soon as possible. This wait time isn't acceptable in this case unfortunately.

Ideally, what I think should happen is this:

  • User hits bookmarklet
  • Data is sent to the server for processing
  • A waiting page is instantly returned while spinning off a thread to do the processing
  • The waiting page periodically polls via ajax for the results of the processing and updates the waiting page (ex: "4 of 567 URLs processed...")
  • the waiting page is updated with the results once they are ready

Some extra details:

  • I'm using Heroku (long running processes are killed after 30 seconds)
  • Both logged in and anonymous users can use this feature

Is this a typical way to do this, or is there a better way? Should I just roll my own off-thread processing that updates the DB during processing or is there something like Delayed Job that I can use for this (and that works on Heroku)? Any pushes in the right direction would be much appreciated.

markquezada
  • 8,444
  • 6
  • 45
  • 52
  • What did you do in the end? – Ari Aug 07 '13 at 01:13
  • @Ari it's been a long time since I worked on this, but in general I used a background processor (I'd use sidekiq today) along with a state machine that tracked progress. Then I just polled using xhr on the frontend until the state was "complete" or whatever you need. – markquezada Aug 07 '13 at 02:17
  • Thanks. So I guess Thread.new wouldn't work on its own? – Ari Aug 07 '13 at 02:46

1 Answers1

1

I think your latter idea makes the most sense. I would just offload the processing of each url-check to its own thread (so all the url checks run concurrently -- which should be a lot faster than sequential checks anyway). As each finishes, it updates the database (making sure the threads don't step on each other's writes). An AJAX endpoint -- which, as you said, you poll regularly on the client side -- will grab and return the count of completed processes from the database. This is a simple enough method that I don't really see the need for any extra components.

Ben Lee
  • 52,489
  • 13
  • 125
  • 145
  • Luckily Typhoeus processes the URLs in parallel, so that's much quicker than doing it serially. It also provides an on_complete callback that I can hook into. (Currently, I'm using it to cache the results in memcache.) I guess what I can't get my head around is this: How do I attach this data to a user? Especially if the user is anonymous. Session ID I guess? I kind of don't want this data to be stored in my DB if it's an anonymous user. – markquezada Nov 09 '10 at 22:27
  • It looks like you already have the system in place. Just add a session ID to the key(s) that you set in your Typhoeus on_complete handlers. And then in the polling endpoint, which accesses these memcache keys based on the session id, can (once everything is processed and returned to the user) purge the relevant keys from the database. But based on your comment, I'm sure you already thought that through and have some issue with it -- but I'm not really following what that issue is. – Ben Lee Nov 09 '10 at 22:58
  • Ah, I guess I just didn't think to use memcache directly as a temporary store for the completed result data. I'm only using it right now to cache the result of the individual url crawl. (Not tied to a specific user.) But you're right, I could totally use memcache to store the complete result of a specific user's request temporarily. That way, it won't junk up the DB for anonymous users since it's not critical data. (It'll be saved persistently for registered users.) Great idea. Thanks for helping me think this through. – markquezada Nov 10 '10 at 00:09