5

I am iterating though the lines in a file using Node.js with CoffeScript and the following function:

each_line_in = (stream, func) ->
    fs.stat stream.path, (err, stats) ->
        previous = []
        stream.on 'data', (d) ->
            start = cur = 0
            for c in d
                cur++
                if c == 10
                    previous.push(d.slice(start, cur))
                    func previous.join('')
                    previous = []
                    start = cur
            previous.push(d.slice(start, cur)) if start != cur

Is there a better way to do this without reading the entire file into memory? And by "better" I mean more succinct, built into Node.js, faster, or more correct. If I was writing Python I would do something like this:

def each_line_in(file_obj, func):
    [ func(l) for l in file_obj ]

I saw this question which uses Peteris Krumin's "lazy" module, but I would like to accomplish this w/o adding an external dependency.

Community
  • 1
  • 1
aaronstacy
  • 6,189
  • 13
  • 59
  • 72
  • When you say "iterate through lines," do you mean "keep reading until you hit a `\n`" or "keep reading until you've read 10 characters" (as your example code does)? – Trevor Burnham Jun 12 '11 at 17:14
  • The code above certainly does not stop after reading just 10 characters. If you tried running the code you would probably see that '10' is the ASCII character for newline. – aaronstacy Jun 13 '11 at 04:06

2 Answers2

6

Here's a fairly efficient approach:

eachLineIn = (filePath, func) ->

  blockSize = 4096
  buffer = new Buffer(blockSize)
  fd = fs.openSync filePath, 'r'
  lastLine = ''

  callback = (err, bytesRead) ->
    throw err if err
    if bytesRead is blockSize
      fs.read fd, buffer, 0, blockSize, null, callback

    lines = buffer.toString('utf8', 0, bytesRead).split '\n'
    lines[0] = lastLine + lines[0]
    [completeLines..., lastLine] = lines
    func(line) for line in completeLines
    return

  fs.read fd, buffer, 0, blockSize, 0, callback
  return

You should benchmark this on your hardware and OS to find the optimal value of blockSize for large files.

Note that this assumes that file lines are divided by \n only. If you're not sure what your files use, you should use a regex for split, e.g.:

.split(/(\\r\\n)|\\r|\\n/)
Trevor Burnham
  • 76,828
  • 33
  • 160
  • 196
  • This does not address my question; it is less succinct, likely slower since the "toString" and "split" methods require reading through the length of the buffer, and no more correct, since you've incorrectly interpreted my code (you would see this if you tried running it). Because of this I am down-voting your answer. – aaronstacy Jun 13 '11 at 04:07
  • 1
    Fair enough, but I just ran a benchmark: Your function took 11,101ms to read a file, while mine took 1,647ms on the same file. Granted, its behavior differs from your code slightly: It takes a file path rather than a `ReadableStream`, and the lines it provides don't include the trailing `\n`. But I wouldn't dismiss the overall approach. It's also no less succinct, if you discard the extra whitespace, the error-throwing, and the explicit `return`s. – Trevor Burnham Jun 13 '11 at 17:48
0

This is a succinct version using a ReadStream, e.g. stream = fs.createReadStream(filepath)

for_each_line = (stream, func) ->
  last = ""
  stream.on('data', (chunk) ->
    lines = (last + chunk).split("\n")
    [lines...,last] = lines
    for line in lines
      func(line)
  )
  stream.on('end', () ->
    func(last)
  )

Options to createReadStream can set the buffer size and encoding as needed.

This strips the '\n', but that can be added back if needed. It also handles a final line, though that will be empty if the file ends with a '\n'.

I don't get much difference in timing of these 3 versions.

hpaulj
  • 221,503
  • 14
  • 230
  • 353