1

At the moment, I'm trying to request a very large JSON object from an API (particularly this one) which, depending on various factors, can be upwards of a few MB. The problem is, however, is that NodeJS takes forever to do anything and then just runs out of memory: the first line of my response callback doesn't ever execute.

I could request each item individually, but that is a tremendous amount of requests. To quote the a dev behind the new API:

Until now, if you wanted to get all the market orders for Tranquility you had to request every type per region individually. That would generally be 50+ regions multiplied by upwards of 13,000 types. Even if it was just 13,000 types and 50 regions, that is 650,000 requests required to get all the market information. And if you wanted to get all the data in the 5-minute cache window, it would require almost 2,200 requests per second.

Obviously, that is not a great idea.

I'm trying to get the array items into redis for use later, then follow the next url and repeat until the last page is reached. Is there any way to do this?

EDIT: Here's the problem code. Visiting the URL works fine in-browser.

    // ...
    REGIONS.forEach((region) => {
      LOG.info(' * Grabbing data for `' + region.name + '#' + region.id + '`');
      var href = url + region.id + '/orders/all/', next = href;
      var page = 1;
      while (!!next) {
        https.get(next, (res) => {
          LOG.info(' *  * Page ' + page++ + ' responded with ' + res.statusCode);
      // ...

The first LOG.info line executes, while the second does not.

CynicalBusiness
  • 127
  • 2
  • 12
  • If the response is only a few MB, why are you running out of memory? I think you'd want to start with that question. I just measured the JSON response and it's 6.23MB. – jfriend00 Jun 04 '16 at 19:52
  • The documentation warns the page can be "several" MB in size, which could mean anything. Either way, there's still a problem with memory and time taken to execute. It does not take this long to simply visit the link in a browser. – CynicalBusiness Jun 04 '16 at 19:56
  • Please show us your node.js code. It works fine even in the browser here: https://jsfiddle.net/jfriend00/qscyqt7d/ – jfriend00 Jun 04 '16 at 19:56
  • I added the code to the post. – CynicalBusiness Jun 04 '16 at 20:02
  • 2
    We need to see more context of your server code. At first blush, I'm guessing that the `while (!!next)` is wrong and you're infinite looping waiting for a variable to be set that can never happen until you stop looping. As long as your `while` loop is running the callback to `https.get()` can never be called. You probably need to change the way you are iterating because your networking calls are async not synchronous. – jfriend00 Jun 04 '16 at 20:04
  • While I don't believe it's the while loop alone, I have traced it to some strange behavior between the loop and the `async` module. I'll work on investigating that. – CynicalBusiness Jun 04 '16 at 20:13
  • Why don't you just include more of the relevant code so we could actually answer the question you asked with some reasonable detail? As it is now, it's unasnwerable since your original hypothesis about large JSON is not actually the issue. If even the browser can handle this JSON, it's not a large JSON issue in node.js unless your node.js environment is really low memory hardware. – jfriend00 Jun 04 '16 at 20:15

3 Answers3

3

It appears that you are doing a while(!!next) loop which is the cause of your problem. If you show more of the server code, we could advise more precisely and even suggest a better way to code it.

Javascript run your code single threaded. That means one thread of execution runs to completion before any other events can be run.

So, if you do:

while(!!next) {
    https.get(..., (res) => {
        // hoping this will run
    });
}

Then, your callback to http.get() will never get called. Your while loop just keeps running forever. As long as it is running, the callback from the https.get() can never get called. That request is likely long since completed and there's an event sitting in the internal JS event queue to call the callback, but until your while() loop finished, that event can't get called. So you have a deadlock. The while() loop is waiting for something else to run to change it's condition, but nothing else can run until the while() loop is done.

There are several other ways to do serial async iterations. In general, you can't use .forEach() or while().

Here are several schemes for async looping:

Node.js: How do you handle callbacks in a loop?

While loop with jQuery async AJAX calls

How to synchronize a sequence of promises?

How to use after and each in conjunction to create a synchronous loop in underscore js

Or, the async library which you mentioned also has functions for doing async looping.

Community
  • 1
  • 1
jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Added several references about iterating async operations. – jfriend00 Jun 04 '16 at 20:19
  • This should be the accepted answer. I am leaving mine just in case someone finds this question through google actually looking for the problem of handling large json payloads. – lorefnon Jun 04 '16 at 20:29
1

First of all, a few MBs of json payload is not exactly huge. So the route handler code might require some close scrutiny.

However, to actually deal with huge amounts of JSON, you can consume your request as a stream. JSONStream (along with many other similar libraries) allow you to do this in a memory efficient way. You can specify the paths you need to process using JSONPath (XPath analog for JSON) and then subscribe to the stream for matching data sets.

Following example from the README of JSONStream illustrates this succinctly:

var request = require('request')
  , JSONStream = require('JSONStream')
  , es = require('event-stream')

request({url: 'http://isaacs.couchone.com/registry/_all_docs'})
  .pipe(JSONStream.parse('rows.*'))
  .pipe(es.mapSync(function (data) {
    console.error(data)
    return data
  }))
lorefnon
  • 12,875
  • 6
  • 61
  • 93
  • It doesn't seem like large JSON is actually the issue here at all. It's more likely an issue with an infinite loop and async looping. – jfriend00 Jun 04 '16 at 20:14
0

Use the stream functionality of the request module to process large amounts of incoming data. As data comes through the stream, parse it to a chunk of data that can be worked with, push that data through the pipe, and pull in the next chunk of data.

You might create a transform stream to manipulate a chunk of data that has been parsed and a write stream to store the chunk of data.

For example:

var stream = request ({ url: your_url }).pipe(parseStream)
    .pipe(transformStream)
   .pipe (writeStream);

stream.on('finish', () => {
    setImmediate (() => process.exit(0));
});

Try for info on creating streams https://bl.ocks.org/joyrexus/10026630

Cmaddux
  • 66
  • 3