3

Although i have found many examples on reading a text file line by line or reading the Nth line, i cannot find anything on how to read from Nth to Mth line.

The file is somewhat big, ~5 Gigabytes (~10 million lines).

EDIT: The lines don't have fixed length.

plexus
  • 53
  • 7

1 Answers1

3

You can use readline functionality to read file as stream without loading it to RAM as a whole. Here is an example of how it can be done:

const fs = require('fs');
const readline = require('readline');

function readFromN2M(filename, n, m, func) {
  const lineReader = readline.createInterface({
    input: fs.createReadStream(filename),
  });

  let lineNumber = 0;

  lineReader.on('line', function(line) {
    lineNumber++;
    if (lineNumber >= n && lineNumber < m) {
      func(line, lineNumber);
    }
  });
}

Let's try it:

// whatever you would like to do with those lines
const fnc = (line, number) => {
  // e.g. print them to console like this:
  console.log(`--- number: ${number}`);
  console.log(line);
};

// read from this very file, lines from 4 to 7 (excluding 7):
readFromN2M(__filename, 4, 7, fnc);

This gives the output:

//  --- number: 4
//  function readFromN2M(filename, n, m, func) {
//  --- number: 5
//    const lineReader = readline.createInterface({
//  --- number: 6
//      input: fs.createReadStream(filename),

Lines are numerated starting from 1. To start from 0 just modify the numbering a little.

UPDATE:

I've just realized, that this approach is not 100% safe in a sense that if some file is not ended with new line char then the very last line of such a file would not be read this way. This is the way readline is designed... To overcome that I go to prepare file streams in little more sophisticated way - by adding new line chars to those streams when required. This would make the solution somewhat longer. But it is all possible.

UPDATE 2

As you've mentioned in the comment the lineReader continues to walk through the even after desired lines have been already found, which slows down the application. I think to we can stop it like this:

lineReader.on('line', function(line) {
  lineNumber++;
  if (lineNumber >= n && lineNumber < m) {
    func(line, lineNumber);
  }

next 3 lines should stop lineReader 'soon', but not immediately as explained in official docs

  if (lineNumber > m) {
    lineReader.close();
  }
});

I believe this should do the trick.

Hero Qu
  • 911
  • 9
  • 10
  • After reading the needed lines i noticed a delay, i guess because linereader continues till the end of file. Is there a way to break out of this useless iteration? That would make it perfect. – plexus Dec 07 '19 at 20:19
  • @plexus I've just added update 2 section to the answer. Hope this should work but I haven't check that by myself. Please try it out. – Hero Qu Dec 07 '19 at 22:26
  • I'm afraid there is no difference. For reading 500 lines positioned at the middle of the file, both methods required ~35''. – plexus Dec 08 '19 at 11:12
  • I would measure the time taken by "lines' consumption" part - may it is not optimized? Otherwise I don't think that reading file with readline is overly slow. I would try with no-operation `func = () => {}` and see if it still same slow... – Hero Qu Dec 08 '19 at 15:33
  • @plexus Have you figured it out? If the slowness is really due to readline functionning, then the only thing I can currently think of to make it all faster is to 1) find position where Nth line starts and M+1 line starts (or Mth ends) - and do it in the most streamlined way possible, then 2) Create read stream with offset equal to the start of Nth line, thus skipping emission of all previous lines by readline altogether, and then 3) Consume such a shortened stream the same way as we did before - with readline. – Hero Qu Dec 09 '19 at 10:14
  • But how to skip all previous lines if their length is not fixed? I cannot figure this out, so I'm thinking of splitting the large dataset to multiple subsets. – plexus Dec 10 '19 at 11:27
  • What I meant was one first find absolute position of a byte inside file where N-th line does start. Then one would initiate readFileStream with an offset equal to thus found number of bytes. It is like if we would have a file that starts right from the N-th line, so we don't waste time going through the lines from first till (n-1)-th with readline. This would save some time, providing that finding this offset is faster then walking from the beginning of the file with readline functionality. That was an idea. – Hero Qu Dec 10 '19 at 22:59
  • Have you tried to use func = () => {} and see if it then works fast? – Hero Qu Dec 10 '19 at 23:06
  • I think the problem is that the line listener is not destroyed instantly. func is not called outside the needed line range. – plexus Dec 11 '19 at 11:08