3

I currently have a csv file that is 1.3 million lines. I'm trying to parse this file line by line and run a processes on each line. The issue I am running into, is I run out of heap memory. I've read online and tried a bunch of solutions to not store the entire file into memory, but it seems nothing is working. Here is my current code:

const readLine = createInterface({
  input: createReadStream(file),
  crlfDelay: Infinity
});

readLine.on('line', async (line) => {
  let record = parse2(`${line}`, {
    delimiter: ',',
    skip_empty_lines: true,
    skip_lines_with_empty_values: false
  });

  // Do something with record

  index++;
  if (index % 1000 === 0) {
    console.log(index);
  }
});

// halts process until all lines have been processed
await once(readLine, 'close');

This starts off strong, but slowly the heap gets filled, and I run out of memory and the program crashes. I'm using a readstream, so I don't understand why the file is filling the heap.

Gary Holiday
  • 3,297
  • 3
  • 31
  • 72
  • May be this is what you're looking for, https://stackoverflow.com/a/23695940 – Juhil Somaiya Feb 11 '20 at 09:41
  • Did my solution help you? Are you still facing the same issue? – Sagar Chilukuri Feb 14 '20 at 13:53
  • Turns out my current solution actually does work, the issue was when I was doing something with the record, the heap was getting filled. – Gary Holiday Feb 14 '20 at 16:13
  • @GaryHoliday How did you resolve it? – Mitanshu Jul 30 '22 at 06:20
  • 1
    @Mitanshu It turns out part (or maybe all) of the problem was what I was doing inside of the `// Do something with record` section. The process I was doing was using all the heap memory. To fix that issue, at the start of my program I looked at how much heap memory was available, then I estimated how much memory each process of a record would take, then I did `maxConcurrentProcesses = heapMemory/memoryOfProcess` and I kept track of how many processes I had running. If I got close to the max, I stop reading from the csv until the processes go down, then I'd start again. – Gary Holiday Jul 31 '22 at 03:11

2 Answers2

1

Try using the library csv-parser https://www.npmjs.com/package/csv-parser

const csv = require('csv-parser');
const fs = require('fs');

fs.createReadStream('data.csv')
  .pipe(csv())
  .on('data', (row) => {
    console.log(row);
  })
  .on('end', () => {
    console.log('CSV file successfully processed');
  });

Taken from: https://stackabuse.com/reading-and-writing-csv-files-with-node-js/

Daniel B.
  • 2,491
  • 2
  • 12
  • 23
  • Tried it, that exact code. Still runs out of heap memory. That's why I tried switching to readline. Getting worse performance now. – Gary Holiday Feb 11 '20 at 09:41
1

I had tried something similar for file for ~2GB and it worked without any issue with EventStream

var fs = require('fs');
var eventStream = require('event-stream');

fs
.createReadStream('veryLargeFile.txt')
.pipe(eventStream.split())
.pipe(
    eventStream
    .mapSync(function(line) {
        // Do something with record `line`
    }).on('error', function(err) {
        console.log('Error while reading file.', err);
    })
    .on('end', function() {
        // On End
    })
)

Please try and let me know if it helps

Sagar Chilukuri
  • 1,430
  • 2
  • 17
  • 29