how to improve this very slow and inefficient Haskell program to process binary files byte by byte?

Question

I am trying to write a hexdump like program in Haskell. I wrote the following program, I am glad that it works and gives desired output but it is very slow and inefficient. It was adapted from the program given in this answer.

I ran the program with a sample file, and it takes about 1 minute to process that less than 1MB file. The standard Linux hexdump program does the job in less about a second. All I want to do in the program is read->process->write all individual bytes in a bytestring.

Here is the question - How to efficiently read/process/write the bytestring (byte by byte, i.e. without using any other functions like getWord32le, if that's what is needed)? I want to do arithmetical and logical operations on each individual byte not necessarily on the Word32le or a group of bytes like that. I didn't find any data type like Byte.

Anyway, here is the code I wrote, which runs successfully on ghci (version 7.4) -

module Main where

import Data.Time.Clock
import Data.Char
import qualified Data.ByteString.Lazy as BIN
import Data.ByteString.Lazy.Char8
import Data.Binary.Get
import Data.Binary.Put
import System.IO
import Numeric (showHex, showIntAtBase)

main = do
  let infile = "rose_rosebud_flower.jpg"
  let outfile = "rose_rosebud_flower.hex"
  h_in  <- openFile infile ReadMode
  System.IO.putStrLn "before time: "
  t1 <- getCurrentTime >>= return . utctDayTime
  System.IO.putStrLn $ (show t1)
  process_file h_in outfile
  System.IO.putStrLn "after time: "
  t2 <- getCurrentTime >>= return . utctDayTime
  System.IO.putStrLn $ (show t2)
  hClose h_in

process_file h_in outfile = do 
  eof <- hIsEOF h_in
  if eof 
      then return ()
      else do  bin1 <- BIN.hGet h_in 1
               let str = (Data.ByteString.Lazy.Char8.unpack) bin1
               let hexchar = getHex str
               System.IO.appendFile outfile hexchar
               process_file h_in outfile

getHex (b:[]) = (tohex $ ord b) ++ " " 
getHex _ = "ERR "

tohex d = showHex d ""

When I run it on the ghci I get

*Main> main
before time: 
23254.13701s
after time: 
23313.381806s

Please provide a modified (but complete working) code as answer and not just the list of names of some functions. Also, don't provide solutions that use jpeg or other image processing libraries as I am not interested in image processing. I used the jpeg image as example non-text file. I just want to process data byte by byte. Also don't provide links to other sites (especially to the documentation (or the lack of it) on the Haskell site). I cannot understand the documentation for bytestring and for many other packages written on the Haskell site, their documentation (which is just type signatures collected on a page, in most cases) seems only meant for the experts, who already understand most of the stuff. If I could figure out the solution by reading their documentation or even the much advertised (real world haskell) RWH book, I'd not have asked this question in the first place.

Sorry for the seeming rant, but the experience with Haskell is frustrating as compared to many other languages, especially when it comes to doing even simple IO as the Haskell IO related documentation with small complete working examples is almost absent.

score 12 · Accepted Answer · answered Aug 30 '15 at 09:09

Your example code reads one byte at a time. That's pretty much guaranteed to be slow. Better still, it reads a 1-byte ByteString and then immediately converts it to a list, negating all the benefits of ByteString. Best of all, it writes to the output file by the slightly strange method of opening the file, appending a single character, and then closing the file again. So for every individual hex character written out, the file has to be completely opened, wound to the end, have a character appended, and then flushed to disk and closed again.

I'm not 100% sure what you're trying to achieve here (i.e., trying to learn how stuff works vs trying to make a specific program work), so I'm not sure exactly how best to answer your question.

If this is your very first foray into Haskell, starting with something I/O-centric is probably a bad idea. You would be better off learning the rest of the language before worrying about how to do high-performance I/O. That said, let me try to answer your actual question...

First, there is no type named "byte". The type you're looking for is called Word8 (if you want an unsigned 8-bit integer) or Int8 (if you want a signed 8-bit integer — which you probably don't). There are also types like Word16, Word32, Word64; you need to import Data.Word to get them. Similarly, Int16, Int32 and Int64 live in Data.Int. The Int and Integer types are automatically imported, so you don't need to do anything special for those.

A ByteString is basically an array of bytes. A [Word8], on the other hand, is a single-linked list of individual bytes which may or may not be computed yet — much less efficient, but far more flexible.

If literally all you want to do is apply a transformation to every single byte, independent of any other byte, then the ByteString package provides a map function which will do exactly that:

map :: (Word8 -> Word8) -> ByteString -> ByteString

If you just want to read from one file and write to another, you can do that using so-called "lazy I/O". This is a neat dodge where the library handles all the I/O chunking for you. It has a few nasty gotchas though; basically revolving around it being hard to know exactly when the input file will get closed. For simple cases, that doesn't matter. For more complicated cases, it does.

So how does it work? Well, the ByteString library has a function

readFile :: FilePath -> IO ByteString

It looks like it reads the entire file into a giant ByteString in memory. But it doesn't. It's a trick. Actually it just checks that the file exists, and opens it for reading. When you try to use the ByteString, in the background the file invisibly gets read into memory as you process it. That means you can do something like this:

main = do
  bin <- readFile "in_file"
  writeFile "out_file" (map my_function bin)

This will read in_file, apply my_function to every individual byte of the file, and save the result into out_file, automatically doing I/O in large enough chunks to give good performance, but never holding more than one chunk in RAM at once. (The my_function part must have type Word8 -> Word8.) So this is both very simple to write, and should be extremely fast.

Things get fun if you don't want to read the entire file, or want to access the file in random order, or anything complicated like that. I am told that the pipes library is the thing to look at, but personally I've never used it.

In the interests of a complete working example:

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Numeric

main = do
  bin <- BIN.readFile "in_file"
  BIN.writeFile "out_file" (BIN.concatMap my_function bin)

my_function :: Word8 -> BIN.ByteString
my_function b =
  case showHex b "" of
    c1:c2:_ -> BIN.pack [fromIntegral $ fromEnum $ c1 , fromIntegral $ fromEnum $ c2]   -- Get first two chars in hex string, convert Char to Word8.
    c2:_    -> BIN.pack [fromIntegral $ fromEnum $ '0', fromIntegral $ fromEnum $ c2]   -- Only one digit. Assume first digit is zeor.

Note that because one byte becomes two hex digits, I've used the ByteString version of concatMap, which allows my_function to return a whole ByteString rather than just a single byte.

`my_function` can be defined in a simpler way as `BL8.pack (showHex b "" ++ " ")`, where `BL8` is `Data.ByteString.Lazy.Char8` (the `ByteString` type is the same across both lazy modules). — duplode, Aug 30 '15 at 09:44
I accept it; it takes about 5 seconds for that file. your explanation is also good. I am new to Haskell but I learnt prelude functions like map etc. The trouble is I am supposed to use it for some file processing (which involves a lot of IO) and there it seems like the prelude functions are almost useless. god knows how could I have proceeded without stackoverflow. strings and IO in Haskell are quite confusing (char8, word8 and what not) and on top of it, for each of these there is a different module. The only place where I get useful beginner level _real world_ Haskell examples is SO. — mntk123, Aug 30 '15 at 10:24
@MathematicalOrchid "Best of all, it writes to the output file by the slightly strange method of opening the file, appending a single character, and then closing the file again." Did I do those things explicitly? and the documentation doesn't tell me all these things you are telling. [see this hoogle entry] (https://www.haskell.org/hoogle/?hoogle=appendFile) and [this doc entry] (http://hackage.haskell.org/package/base-4.8.1.0/docs/Prelude.html#v:appendFile) don't seem to tell anything like that about appendFile. — mntk123, Aug 30 '15 at 11:08
I'm not sure why you say prelude functions are useless. The concepts are the same (as mentioned in the answer you can find `map` which works as you expected on a `ByteString` as on a list). So if you learn the important concepts then you should be fine. I'm saying this as a newbie who did a [quite I/O heavy](https://github.com/jtcwang/HaskSplit) program as my first Haskell project too. It was frustrating (laziness, needing to learn so many things) but not being able to transfer my "Prelude" knowledge was not one of them. — Jacob Wang, Aug 30 '15 at 11:09
@mntk123, for your second question about `apendFile`, when the function takes a `FilePath` instead of an opened file handle then you can probably deduce that it opens the file using the provided `FilePath`, write to it, then close it down. (This applies to pretty much all `System.IO` functions) — Jacob Wang, Aug 30 '15 at 11:16
@jtcwang I looked at your code `HaskSplit` it's impressive, may be I will be able to understand it some day, but already a problem I am facing is you used `import GHC.Word` in `Split.hs`, why not `Data.Word`? and how to decide this? it is very confusing. again this is just one example; the string and IO world in Haskell is very scary and without IO you cannot do any significant work and the Haskell people always tell the newcomers they should keep IO away. I wonder if there are any interactive games written in Haskell or not. I learned to program by writing simple interactive games. — mntk123, Aug 30 '15 at 15:01
@jtcwang I say that _prelude functions are useless_ as most of the knowledge you get seems irrelevant e.g. pattern matching you learn on prlude String doesn't work on ByteString. there are some other issues too. — mntk123, Aug 30 '15 at 15:15
@mntk123 (1) "you used import GHC.Word in Split.hs, why not Data.Word?" `Data.Word` would have worked just as well. You never need to import the `GHC.*` modules (unless you are e.g. working on the compiler itself), as there are more user-friendly versions of them elsewhere that you can use instead (in this case, `Data.Word`). (2) "I wonder if there are any interactive games written in Haskell or not." Sure there are! :) [Random example using OpenGL](https://github.com/rtperson/Haskell-Pong). — duplode, Aug 30 '15 at 18:51
@mntk123 (3) The underlying point in everyone's comments is that to find your way around Haskell libraries you need to get used to squeezing as much information as possible from type signatures, as well as recognising common themes in them. I won't deny that the various string types can be confusing at first, but if you focus on the types of the functions you will note they all can be used in quite similar ways. (BTW, there is also `Text`, a string type which is a little slower than `ByteString`, though still much faster than `String`, and with full Unicode support for working with text data.) — duplode, Aug 30 '15 at 19:01

how to improve this very slow and inefficient Haskell program to process binary files byte by byte?

1 Answers1