3

I want to select the n-th last line from a large text file (~10GB) in a Haskell program.

I found a way how to get the n-th last from an internal string:

myLen = 7
n = 3 -- one-based from the end

myLines = lines myText
idx = myLen - n

theLine = head (drop idx myLines)

main :: IO ()
main = do
  putStrLn theLine

The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?

So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?

Sridhar Ratnakumar
  • 81,433
  • 63
  • 146
  • 187
halloleo
  • 9,216
  • 13
  • 64
  • 122
  • I believe the most efficient way would be to read the file in e.g. 64kb blocks from the end, trying to find the nth newline character from the end, and then doing a final read from that newline to the next. You could do this with `hSeek` and the handle I/O functions in Data.ByteString. Even with lazy strings, reading from the front here is going to be terribly inefficient in a 10gb file. – Dogbert Jul 02 '22 at 03:54
  • 3
    If you write code that only performs a "single pass" over the list returned by `readFile`, that should run without having to store the whole list in memory. (Note that haskell native strings are somehow inefficient, and lazy bytestrings could be a better option). Or, more precisely, the lines will be stored in memory but will be garbage collected so everything should run in constant-ish space. However, while `head (drop 10 xs)` is single pass, `head (drop (length xs - 4) xs` is double-pass, and that will surely store the full list in memory. Beware. – chi Jul 02 '22 at 07:28
  • 2
    I wish there were a library for this kind of tasks. One which would allow opening a large text file as if in a text editor, without loading the whole file in RAM, and then moving around with editor-like commands ("go up 4 lines", search for "foo", etc.). Doing this by hand is cumbersome, especially if the text file is UTF8 encoded. – chi Jul 02 '22 at 07:33
  • 1
    Somewhat related: https://stackoverflow.com/questions/41654849/hseek-and-seekfromend-in-haskell – danidiaz Jul 02 '22 at 10:59
  • @chi Thanks for chiming! Re `drop (length xs - 4) xs`, I know the number of lines ahead of time, as indicated with the separate variable in my internal string demo. I don't mind _one_ full pass through the file, just not storing all lines in memory. – halloleo Jul 03 '22 at 06:24
  • No need to load the entire file into memory: the `extra` package has a function [takeEnd](https://hackage.haskell.org/package/extra-1.7.10/docs/Data-List-Extra.html#v:takeEnd) which may enlighten. – beerboy Jul 03 '22 at 21:13

2 Answers2

3

The question has several parts:

The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?

Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.

So, is readFile the right approach here?

My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.

If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.

yairchu
  • 23,680
  • 7
  • 69
  • 109
  • Related writing on worm cans: [Haskell Lazy ByteString + read/write progress function](https://stackoverflow.com/q/6668716/791604). – Daniel Wagner Jul 04 '22 at 04:43
2

The following program will require only about as much memory as the longest n lines in the file being read:

-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs

takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as

main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile

The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).

Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380