Why is this simple text analysis program so slow?

Question

This is my code for counting lines and words:

import System.IO
import Data.List
main = do
        hSetBinaryMode stdin True
        interact $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n")
                   . foldl' (\(w,l) r-> w `seq` l `seq` (w+length r ,succ l) ) (0,0)
                   . lines

This takes about 10 seconds to run on a file of about 100 megabytes. I compared it to similar programs in Lua (9s), awk (20s), and wc -l -c (0.6s).

Why is this code so slow? What could be the problem?

Try using ByteString : http://stackoverflow.com/questions/9746352/parsing-large-log-files-in-haskell — , Apr 02 '12 at 14:58

score 15 · Accepted Answer · answered Apr 02 '12 at 15:25

I/O with String is known to be less than fast in Haskell. The bytes read from the handle in general have to be converted to Unicode code points, and then a linked list is built from those. That's a lot of work causing a lot of allocation. In this case, the conversion to code points is a bit simpler, since you set stdin to binary mode, but the construction of the linked list of characters still takes a lot of time.

Another small factor is that your line count is using Integer, but that's minor and only plays a significant role when the I/O is up to speed.

If you need fast I/O, you have to use a type better suited for that. One possibility is using ByteString, for example

import Data.List
import qualified Data.ByteString.Lazy.Char8 as C
main = do
        txt <- C.getContents
        putStrLn $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). foldl' (\(w,l) r-> w `seq` l `seq` (w+C.length r ,succ l) ) (0,0) . C.lines $ txt

does the job on a 94MB file in 0.12s on my box (wc -l -c takes 0.06s), while the original using String took 4.4s. It can be optimised further,

{-# LANGUAGE BangPatterns #-}
import Data.List
import qualified Data.ByteString.Lazy.Char8 as C
main = do
        txt <- C.getContents
        putStrLn $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). loop 0 0 . C.lines $ txt

loop :: Int -> Int -> [C.ByteString] -> (Int,Int)
loop !w !l (ln:lns) = loop (w + fromIntegral (C.length ln)) (l+1) lns
loop w l _ = (w,l)

takes only 0.08s, which is decent enough for me to stop optimising there (a similar change for the String version brings the time down to 3.6s for that).

If Daniel's answer helped you, you should mark it accepted by clicking the check mark beside it, so that your question is marked as solved and others who read this question can see the answer helped you :) — ehird, Apr 02 '12 at 15:49

Why is this simple text analysis program so slow?

1 Answers1