7

So I have about a 8mb file of each with 6 ints seperated by a space.

my current method for parsing this is:

tuplify6 :: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

toInts :: String -> (Int, Int, Int, Int, Int, Int)
toInts line =
        tuplify6 $ map read stringNumbers
        where stringNumbers = split " " line

and mapping toInts over

liftM lines . readFile

which will return me a list of tuples. However, When i run this, it takes nearly 25 seconds to load the file and parse it. Any way I can speed this up? The file is just plain text.

DantheMan
  • 7,247
  • 10
  • 33
  • 36
  • could you provide a bit more information: the whole working program, input, how you run it, do you compile it (with optimization) or run it in `ghci`. Do you know about `Data.Bytestring` and `Data.Vector`. Also `read` is quite slow, at least that is what i have heard. – epsilonhalbe Jul 03 '12 at 21:58
  • See also http://stackoverflow.com/questions/8366093/how-do-i-parse-a-matrix-of-integers-in-haskell/8366642 – Thomas M. DuBuisson Jul 03 '12 at 22:06

1 Answers1

8

You can speed it up by using ByteStrings, e.g.

module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as C
import Data.Char

main :: IO ()
main = do
    args <- getArgs
    mapM_ doFile args

doFile :: FilePath -> IO ()
doFile file = do
    bs <- C.readFile file
    let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs
    print (length tups)

buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)]
buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs
buildTups k acc bs
    | C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k)
    | otherwise = case C.readInt bs of
                    Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm
                    Nothing -> error ("No Int found: " ++ show (C.take 100 bs))

tuplify6:: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

runs pretty fast:

$ time ./fileParse IntList 
200000

real    0m0.119s
user    0m0.115s
sys     0m0.003s

for an 8.1 MiB file.

On the other hand, using Strings and your conversion (with a couple of seqs to force evaluation) also took only 0.66s, so the bulk of the time seems to be spent not parsing, but working with the result.

Oops, missed a seq so the reads were not actually evaluated for the String version. Fixing that, String + read takes about four seconds, a bit above one with the custom Int parser from @Rotsor's comment

foldl' (\a c -> 10*a + fromEnum c - fromEnum '0') 0

so parsing apparently did take a significant amount of the time.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • Thanks. I forgot about haskell Lazy evaluation so I was wrong about where the timing issue came from. But thanks for the other method also! – DantheMan Jul 03 '12 at 22:24
  • Can you please show the whole program that achieves 0.66s with `read`? I've [asked a similar question](http://stackoverflow.com/questions/7510078/why-is-char-based-input-so-much-slower-than-the-char-based-output-in-haskell) before and the answer was "read is slow". Here, merely replacing `read` with `foldl (\a c -> a*10 + fromEnum c - fromEnum '0') 0` gives 6-fold improvement in speed, showing that the most time was indeed taken by parsing. How did you manage to improve on that? – Rotsor Jul 04 '12 at 12:24