I have to parse a large (100+ MB) JSON file with the following format:
{
"metadata": {
"account_id": 1234
// etc.
},
"transactions": [
{
"transaction_id": 1234,
"amount": 2
},
// etc. for (potentially) 1000's of lines
]
}
The output of this parsing is a JSON array with the account_id
appended to each of the transactions
:
[
{
"account_id": 1234,
"transaction_id": 1234,
"amount": 2
},
// etc.
]
I'm using the stream-json library to avoid loading the whole file into memory at the same time. stream-json allows me to pick individual properties, and then stream them one at a time, depending on if they're an array or object
I'm also trying to avoid parsing the JSON twice by piping the read of the JSON file to two separate streams, which is possible in nodejs.
I'm using a Transform stream for generating the output, setting a property on the Transform stream object that stores the account_id
.
Pseudo code (with obvious race condition) below:
const { parser } = require('stream-json');
const { pick } = require('stream-json/filters/Pick');
const { streamArray } = require('stream-json/streamers/StreamArray');
const { streamObject } = require('stream-json/streamers/StreamObject');
const Chain = require('stream-chain');
const { Transform } = require('stream');
let createOutputObject = new Transform({
writableObjectMode:true,
readableObjectMode:true,
transform(chunk, enc, next) => {
if (createOuptutObject.account_id !== null) {
// generate the output object
} else {
// Somehow store the chunk until we get the account_id...
}
}
});
createOutputObject.account_id = null;
let jsonRead = fs.createReadStream('myJSON.json');
let metadataPipline = new Chain([
jsonRead,
parser(),
pick({filter: 'metadata'}),
streamObject(),
]);
metadataPipeline.on('data', data => {
if (data.key === 'account_id') {
createOutputObject.account_id = data.value;
}
});
let generatorPipeline = new Chain([
jsonRead, // Note same Readable stream as above
parser(),
pick({filter: 'tracks'}),
streamArray(),
createOutputObject,
transformToJSONArray(),
fs.createWriteStream('myOutput.json')
]);
To resolve this race condition (i.e. converting to JSON array before account_id
is set), I've tried:
- Using
createOutputObject.cork()
to hold data up untilaccount_id
is set.- The data just passes through to
transformToJSONArray()
.
- The data just passes through to
- Keeping the
chunk
s in an array increateOutputObject
untilaccount_id
is set.- Can't figure out how to re-add the stored
chunk
s afteraccount_id
is set.
- Can't figure out how to re-add the stored
- Using
setImmediate()
andprocess.nextTick()
to callcreateOutputObject.transform
later on, hoping thataccount_id
is set.- Overloaded stack so that nothing could get done.
I've considered using stream-json's streamValues
function, which would allow me to do a pick
of metadata
and transactions
. But the documentation leads me to believe that all of transactions
would be loaded into memory, which is what I'm trying to avoid:
As every streamer, it assumes that individual objects can fit in memory, but the whole file, or any other source, should be streamed.
Is there something else that can resolve this race condition? Is there anyway I can avoid parsing this large JSON stream twice?