How does hGetContents achieve memory efficiency?

Question

I want to add Haskell to my toolbox so I'm working my way through Real World Haskell.

In the chapter in Input and Output, in the section on hGetContents, I came across this example:

import System.IO
import Data.Char(toUpper)

main :: IO ()
main = do 
    inh <- openFile "input.txt" ReadMode
    outh <- openFile "output.txt" WriteMode
    inpStr <- hGetContents inh
    let result = processData inpStr
    hPutStr outh result
    hClose inh
    hClose outh

processData :: String -> String
processData = map toUpper

Following this code sample, the authors go on to say:

Notice that hGetContents handled all of the reading for us. Also, take a look at processData. It's a pure function since it has no side effects and always returns the same result each time it is called. It has no need to know—and no way to tell—that its input is being read lazily from a file in this case. It can work perfectly well with a 20-character literal or a 500GB data dump on disk. (N.B. Emphasis is mine)

My question is: how does hGetContents or its resultant values achieve this memory efficiency without – in this example – processData "being able to tell", and still maintain all benefits that accrue to pure code (i.e. processData), specifically memoization?

<- hGetContents inh returns a string so inpStr is bound to a value of type String, which is exactly the type that processData accepts. But if I understand the authors of Real World Haskell correctly, then this string isn't quite like other strings, in that it's not fully loaded into memory (or fully evaluated, if such a things as not-fully-evaluated strings exists...) by the time of the call to processData.

Therefore, another way to ask my question is: if inpStr is not fully evaluated or loaded into memory at the time of the call to processData, then how can it be used to lookup if a memoized call to processData exists, without first fully evaluating inpStr?

Are there instances of type String that each behave differently but cannot be told apart at this level of abstraction?

score 4 · Accepted Answer · answered Oct 17 '13 at 07:41

The String returned by hGetContents is no different from any other Haskell string. In general, Haskell data is not fully evaluated unless the programmer has taken extra steps to ensure that it is (e.g. seq, deepseq, bang patterns).

Strings are defined as (essentially)

data List a = Nil | Cons a (List a) -- Nil === [], Cons === :
type String = List Char

This means that a string is either empty, or a single character (the head) and another string (the tail). Due to laziness, the tail may not exist in memory, and may even be infinite. Upon processing a String, a Haskell program will typically check if it's Nil or Cons, then branch and proceed as necessary. If the function doesn't need to evaluate the tail, it won't. This function, for example, only needs to check the initial constructor:

safeHead :: String -> Maybe Char
safeHead [] = Nothing
safeHead (x:_) = Just x

This is a perfectly legitimate string

allA's = repeat 'a' :: String

that is infinite. You can manipulate this string sensibly, however if you try to print all of it, or calculate the length, or any sort of unbounded traversal your program won't terminate. But you can use functions like safeHead without any problem whatsoever, and even consume some finite initial substring.

Your intuition that something strange is happening is correct, however. hGetContents is implemented using the special function unsafeInterleaveIO, which is essentially a compiler hook into IO behavior. The less said about this, the better.

You're correct that it would be difficult to check if a memoized call to a function exists without having the argument fully evaluated. However, most compilers don't perform this optimization. The problem is that it's very difficult for a compiler to determine when it's worthwhile to memoize calls, and very easy to consume all of your memory by doing so. Fortunately there are several memoizing libraries you can use to add memoization when appropriate.

Thanks so much for your answer. Just to satisfy my curiosity, could you tell me if the GHC performs any memoization, or does it not perform any optimizations of that kind for the reasons you've just mentioned? I'd guess that memoization of functions with atomic arguments (i.e. no data structures such as lists) would be rather more easy to implement. — Marcel, Oct 17 '13 at 10:18
GHC does not perform any automatic memoization of functions at all. It does however ensure that the values of variables are only computed once. — kqr, Oct 17 '13 at 15:29
@MarcelOomens: it's hard to answer generally, however branches on simple enumerations (e.g. `data Foo = Foo | Bar | Baz`) often turn into memoized calls. This appears to be a consequence of the usual GHC optimization transforms, I don't think there's any special support for it. — John L, Oct 18 '13 at 00:57

How does hGetContents achieve memory efficiency?

1 Answers1