144

I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

Format A:

[{name: 'thing1'},
....
{name: 'thing999999999'}]

or Format B:

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

Note that the ... indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse() like this:

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse() on such objects?

Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

I can choose to use FormatA or FormatB or maybe something else, just please specify in your answer. Thanks!

Amol M Kulkarni
  • 21,143
  • 34
  • 120
  • 164
dgh
  • 8,969
  • 9
  • 38
  • 49
  • For format B you could parse through the chunk for new lines, and extract each whole line, concatenating the rest if it cuts off in the middle. There may be a more elegant way though. I haven't worked with streams to much. – travis Aug 08 '12 at 22:39

11 Answers11

101

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pump is called.

If there's no newline in the buffer, pump simply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

If there is a newline, pump slices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the while loop). In this way, we can process all of the lines that were read in the current chunk.

Finally, process is called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parse one the line. At this point, you can do whatever you need to with your object.

Note that JSON.parse is strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'} will throw an error; you must use {"name":"thing1"}.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

Kevin B
  • 94,570
  • 16
  • 163
  • 180
josh3736
  • 139,160
  • 33
  • 216
  • 263
  • 15
    This answer is now redundant. Use JSONStream, and you have out of the box support. – arcseldon Jul 12 '14 at 05:45
  • 3
    The function name 'process' is bad. 'process' should be a system variable. This bug confused me for hours. – Zhigong Li Apr 29 '15 at 07:36
  • 1
    Please consider editing and adding a note that dedicated libraries now exist to do this, and may be preferable to this hand-rolled solution. See @arcseldon's answer at http://stackoverflow.com/a/24710073/500207 – Ahmed Fasih Apr 29 '15 at 13:33
  • 37
    @arcseldon I don't think the fact that there's a library that does this makes this answer redundant. It's certainly still useful to know how this can be done without the module. – Kevin B Aug 27 '15 at 18:00
  • 4
    I am not sure if this would work for a minified json file. What if the whole file was wrapped up in a single line, and using any such delimiters wasn't possible? How do we solve this problem then? – SLearner Aug 31 '15 at 12:47
  • 21
    Third party libraries are not made of magic you know. They are just like this answer, elaborated versions of hand-rolled solutions, but just packed and labeled as a program. Understanding how things work is much more important and relevant than blindly throwing data into a library expecting results. Just saying :) – zanona Mar 27 '16 at 14:00
  • Doesn't `buf += data` mean that everything coming back from the large file's stream will be stored in memory anyway? Doesn't this defeat the purpose of using a read stream? It seems like `fs.readFile` would be just as memory-inefficient. – Dan Jan 24 '21 at 19:26
  • 1
    @Dan Yes, the data is continually stored in the buffer as its read to be processed, but you'll notice at the end of `while` loop in `pump()`, we slice off the processed data after it's sent to `processLine()` for parsing. – Griffin Mar 04 '21 at 17:55
48

Just as I was thinking that it would be fun to write a streaming JSON parser, I also thought that maybe I should do a quick search to see if there's one already available.

Turns out there is.

Since I just found it, I've obviously not used it, so I can't comment on its quality, but I'll be interested to hear if it works.

It does work consider the following Javascript and _.isString:

stream.pipe(JSONStream.parse('*'))
  .on('data', (d) => {
    console.log(typeof d);
    console.log("isString: " + _.isString(d))
  });

This will log objects as they come in if the stream is an array of objects. Therefore the only thing being buffered is one object at a time.

Nick Bull
  • 9,518
  • 6
  • 36
  • 58
  • Just the thing I was searching for since 2 days! Thanks a lot :) – Atharva Kulkarni Aug 11 '21 at 14:51
  • 1
    @AtharvaKulkarni: [JSONstream hasn't been maintained since 2018](https://github.com/dominictarr/JSONStream/issues). You may want to evaluate [stream-json](https://www.npmjs.com/package/stream-json) or [@streamparser/json](https://www.npmjs.com/package/@streamparser/json). – Dan Dascalescu May 24 '23 at 22:43
41

As of October 2014, you can just do something like the following (using JSONStream) - https://www.npmjs.org/package/JSONStream

var fs = require('fs'),
    JSONStream = require('JSONStream'),

var getStream() = function () {
    var jsonData = 'myData.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
}

getStream().pipe(MyTransformToDoWhateverProcessingAsNeeded).on('error', function (err) {
    // handle any errors
});

To demonstrate with a working example:

npm install JSONStream event-stream

data.json:

{
  "greeting": "hello world"
}

hello.js:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    es = require('event-stream');

var getStream = function () {
    var jsonData = 'data.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
};

getStream()
    .pipe(es.mapSync(function (data) {
        console.log(data);
    }));
$ node hello.js
// hello world
phwt
  • 1,356
  • 1
  • 22
  • 42
arcseldon
  • 35,523
  • 17
  • 121
  • 125
  • 2
    This is mostly true and useful, but I think you need to do `parse('*')` or you won't get any data. – John Zwinck Oct 02 '14 at 02:42
  • @JohnZwinck Thank you, have updated the answer, and added a working example to demonstrate it fully. – arcseldon Oct 02 '14 at 11:23
  • in the first code block, the first set of parentheses `var getStream() = function () {` should be removed. – givemesnacks Jul 30 '15 at 16:20
  • 4
    This failed with an out of memory error with a 500mb json file. – Keith John Hutchison Aug 16 '16 at 10:14
  • As of May 2023, [JSONstream hasn't been maintained since 2018](https://github.com/dominictarr/JSONStream/issues). You may want to evaluate [stream-json](https://www.npmjs.com/package/stream-json) or [@streamparser/json](https://www.npmjs.com/package/@streamparser/json). – Dan Dascalescu May 24 '23 at 22:46
29

I had similar requirement, i need to read a large json file in node js and process data in chunks and call a api and save in mongodb. inputFile.json is like:

{
 "customers":[
       { /*customer data*/},
       { /*customer data*/},
       { /*customer data*/}....
      ]
}

Now i used JsonStream and EventStream to achieve this synchronously.

var JSONStream = require("JSONStream");
var es = require("event-stream");

fileStream = fs.createReadStream(filePath, { encoding: "utf8" });
fileStream.pipe(JSONStream.parse("customers.*")).pipe(
  es.through(function(data) {
    console.log("printing one customer object read from file ::");
    console.log(data);
    this.pause();
    processOneCustomer(data, this);
    return data;
  }),
  function end() {
    console.log("stream reading ended");
    this.emit("end");
  }
);

function processOneCustomer(data, es) {
  DataModel.save(function(err, dataModel) {
    es.resume();
  });
}
1252748
  • 14,597
  • 32
  • 109
  • 229
karthick N
  • 299
  • 3
  • 4
  • Thank you so much for adding your answer, my case also needed some synchronous handling. However after testing it was not possible for me to call "end()" as a callback after the pipe is finished. I believe the only thing which could be done is adding an event, what should happen after the stream is 'finished' / 'close' with ´fileStream.on('close', ... )´. – nonNumericalFloat Dec 09 '19 at 22:32
  • 1
    Hey - this was a great solution BUT there's a type in your code. You have a parenthesis closing BEFORE [code]function end ()[/code] - but you need to move it afterward - otherwise end () is not included in the es.through(). – remed.io Mar 04 '21 at 15:11
  • [JSONstream hasn't been maintained since 2018](https://github.com/dominictarr/JSONStream/issues). You may want to evaluate [stream-json](https://www.npmjs.com/package/stream-json) or [@streamparser/json](https://www.npmjs.com/package/@streamparser/json). – Dan Dascalescu May 24 '23 at 22:43
28

I realize that you want to avoid reading the whole JSON file into memory if possible, however if you have the memory available it may not be a bad idea performance-wise. Using node.js's require() on a json file loads the data into memory really fast.

I ran two tests to see what the performance looked like on printing out an attribute from each feature from a 81MB geojson file.

In the 1st test, I read the entire geojson file into memory using var data = require('./geo.json'). That took 3330 milliseconds and then printing out an attribute from each feature took 804 milliseconds for a grand total of 4134 milliseconds. However, it appeared that node.js was using 411MB of memory.

In the second test, I used @arcseldon's answer with JSONStream + event-stream. I modified the JSONPath query to select only what I needed. This time the memory never went higher than 82MB, however, the whole thing now took 70 seconds to complete!

Evan Siroky
  • 9,040
  • 6
  • 54
  • 73
  • JSONstream had some [memory issues and hasn't been maintained since 2018](https://github.com/dominictarr/JSONStream/issues). You may want to evaluate [stream-json](https://www.npmjs.com/package/stream-json) or [@streamparser/json](https://www.npmjs.com/package/@streamparser/json). – Dan Dascalescu May 24 '23 at 22:44
10

I wrote a module that can do this, called BFJ. Specifically, the method bfj.match can be used to break up a large stream into discrete chunks of JSON:

const bfj = require('bfj');
const fs = require('fs');

const stream = fs.createReadStream(filePath);

bfj.match(stream, (key, value, depth) => depth === 0, { ndjson: true })
  .on('data', object => {
    // do whatever you need to do with object
  })
  .on('dataError', error => {
    // a syntax error was found in the JSON
  })
  .on('error', error => {
    // some kind of operational error occurred
  })
  .on('end', error => {
    // finished processing the stream
  });

Here, bfj.match returns a readable, object-mode stream that will receive the parsed data items, and is passed 3 arguments:

  1. A readable stream containing the input JSON.

  2. A predicate that indicates which items from the parsed JSON will be pushed to the result stream.

  3. An options object indicating that the input is newline-delimited JSON (this is to process format B from the question, it's not required for format A).

Upon being called, bfj.match will parse JSON from the input stream depth-first, calling the predicate with each value to determine whether or not to push that item to the result stream. The predicate is passed three arguments:

  1. The property key or array index (this will be undefined for top-level items).

  2. The value itself.

  3. The depth of the item in the JSON structure (zero for top-level items).

Of course a more complex predicate can also be used as necessary according to requirements. You can also pass a string or a regular expression instead of a predicate function, if you want to perform simple matches against property keys.

Phil Booth
  • 4,853
  • 1
  • 33
  • 35
5

If you have control over the input file, and it's an array of objects, you can solve this more easily. Arrange to output the file with each record on one line, like this:

[
   {"key": value},
   {"key": value},
   ...

This is still valid JSON.

Then, use the node.js readline module to process them one line at a time.

var fs = require("fs");

var lineReader = require('readline').createInterface({
    input: fs.createReadStream("input.txt")
});

lineReader.on('line', function (line) {
    line = line.trim();

    if (line.charAt(line.length-1) === ',') {
        line = line.substr(0, line.length-1);
    }

    if (line.charAt(0) === '{') {
        processRecord(JSON.parse(line));
    }
});

function processRecord(record) {
    // Process the records one at a time here! 
}
Steve Hanov
  • 11,316
  • 16
  • 62
  • 69
4

I solved this problem using the split npm module. Pipe your stream into split, and it will "Break up a stream and reassemble it so that each line is a chunk".

Sample code:

var fs = require('fs')
  , split = require('split')
  ;

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var lineStream = stream.pipe(split());
linestream.on('data', function(chunk) {
    var json = JSON.parse(chunk);           
    // ...
});
Brian Leathem
  • 4,609
  • 1
  • 24
  • 44
0

Using the @josh3736 answer, but for ES2021 and Node.js 16+ with async/await + AirBnb rules:

import fs from 'node:fs';

const file = 'file.json';

/**
 * @callback itemProcessorCb
 * @param {object} item The current item
 */

/**
 * Process each data chunk in a stream.
 *
 * @param {import('fs').ReadStream} readable The readable stream
 * @param {itemProcessorCb} itemProcessor A function to process each item
 */
async function processChunk(readable, itemProcessor) {
  let data = '';
  let total = 0;

  // eslint-disable-next-line no-restricted-syntax
  for await (const chunk of readable) {
    // join with last result, remove CR and get lines
    const lines = (data + chunk).replace('\r', '').split('\n');

    // clear last result
    data = '';

    // process lines
    let line = lines.shift();
    const items = [];

    while (line) {
      // check if isn't a empty line or an array definition
      if (line !== '' && !/[\[\]]+/.test(line)) {
        try {
          // remove the last comma and parse json
          const json = JSON.parse(line.replace(/\s?(,)+\s?$/, ''));
          items.push(json);
        } catch (error) {
          // last line gets only a partial line from chunk
          // so we add this to join at next loop
          data += line;
        }
      }

      // continue
      line = lines.shift();
    }

    total += items.length;

    // Process items in parallel
    await Promise.all(items.map(itemProcessor));
  }

  console.log(`${total} items processed.`);
}

// Process each item
async function processItem(item) {
  console.log(item);
}

// Init
try {
  const readable = fs.createReadStream(file, {
    flags: 'r',
    encoding: 'utf-8',
  });

  processChunk(readable, processItem);
} catch (error) {
  console.error(error.message);
}

For a JSON like:

[
  { "name": "A", "active": true },
  { "name": "B", "active": false },
  ...
]
Gabriel Anderson
  • 1,304
  • 14
  • 17
-2
https.get(url1 , function(response) {
  var data = ""; 
  response.on('data', function(chunk) {
    data += chunk.toString(); 
  }) 
  .on('end', function() {
    console.log(data)
  });
});
Saeed Zhiany
  • 2,051
  • 9
  • 30
  • 41
  • Please edit your answer and describe how this code resolves the problem of *parsing a large **JSON file***. – trincot Aug 16 '22 at 12:33
  • Please read [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). While this code block may answer the OP's question, this answer would be much more useful if you explain how this code is different from the code in the question, what you've changed, why you've changed it and why that solves the problem without introducing others. – Saeed Zhiany Aug 16 '22 at 13:24
-8

I think you need to use a database. MongoDB is a good choice in this case because it is JSON compatible.

UPDATE: You can use mongoimport tool to import JSON data into MongoDB.

mongoimport --collection collection --file collection.json
Vadim Baryshev
  • 25,689
  • 4
  • 56
  • 48