2

I'm returning A LOT (500k+) documents from a MongoDB collection in Node.js. It's not for display on a website, but rather for data some number crunching. If I grab ALL of those documents, the system freezes. Is there a better way to grab it all?

I'm thinking pagination might work?

Edit: This is already outside the main node.js server event loop, so "the system freezes" does not mean "incoming requests are not being processed"

Community
  • 1
  • 1
Shamoon
  • 41,293
  • 91
  • 306
  • 570
  • why do you need it? Nodejs is not for data crunching. Would you consider using MongoDB map/reduce functionality? – Eldar Djafarov Nov 28 '11 at 15:44
  • Unfortunately we're locked into `Node.js` – Shamoon Nov 28 '11 at 15:48
  • 1
    Unless you tell what exactly you are trying to do and how you're currently doing it, this is impossible to answer. – Tomalak Nov 28 '11 at 15:52
  • 2
    Do you have a code snippet to illustrate how you are fetching these docs from mongo? What exactly do you mean by "freezing" in this case - is it your node process choking or mongo itself? Make sure you are not using .toArray or anything that will try to force node to allocate huge blocks of memory as it exhausts the cursor for your query. You could do pagination with skip() and limit() but this shouldn't be relied on for queries that will be executed frequently, as it gets expensive. A better way to paginate might be to use $gt with the val of the last record of the page on an indexed field. – mpobrien Nov 28 '11 at 16:56
  • I'd like to second mpobrien's comment about skip/limit. At 500k documents, limit/skip is not an option. You should have some monotonic key you can use w/ `$gt` such as a timestamp (make sure it's indexed). Even with much less documents, skipping is too slow. – mnemosyn Nov 28 '11 at 19:04

3 Answers3

2

I would put your big fetch+process task on a worker queue, background process, or forking mechanism (there are a lot of different options here).

That way you do your calculations outside of your main event loop and keep that free to process other requests. While you should be doing your Mongo lookup in a callback, the calculations themselves may take up time, thus "freezing" node - you're not giving it a break to process other requests.

RyanWilcox
  • 13,890
  • 1
  • 36
  • 60
2

After learning more about your situation, I have some ideas:

  1. Do as much as you can in a Map/Reduce function in Mongo - perhaps if you throw less data at Node that might be the solution.

  2. Perhaps this much data is eating all your memory on your system. Your "freeze" could be V8 stopping the system to do a garbage collection (see this SO question). You could Use V8 flag --trace-gc to log GCs & prove this hypothesis. (thanks to another SO answer about V8 and Garbage collection

  3. Pagination, like you suggested may help. Perhaps even splitting up your data even further into worker queues (create one worker task with references to records 1-10, another with references to records 11-20, etc). Depending on your calculation

  4. Perhaps pre-processing your data - ie: somehow returning much smaller data for each record. Or not using an ORM for this particular calculation, if you're using one now. Making sure each record has only the data you need in it means less data to transfer and less memory your app needs.

Community
  • 1
  • 1
RyanWilcox
  • 13,890
  • 1
  • 36
  • 60
1

Since you don't need them all at the same time (that's what I've deduced from you asking about pagination), perhaps it's better to separate those 500k stuff into smaller chunks to be processed at the nextTick?

You could also use something like Kue to queue the chunks and process them later (thus not everything in the same time).

alessioalex
  • 62,577
  • 16
  • 155
  • 122