3

I wish if someone gives a complete working code that allows to do the following in Haskell:

Read a very large sequence (more than 1 billion elements) of 32-bit int values from a binary file into an appropriate container (e.g. certainly not a list, for performance issues) and doubling each number if it's less than 1000 (decimal) and then write the resulting 32-bit int values to another binary file. I may not want to read the entire contents of the binary file in the memory at once. I want to read one chunk after the previous.

I am confused because I could find very little documentation about this. Data.Binary, ByteString, Word8 and what not, it just adds to the confusion. There is pretty straight-forward solution to such problems in C/C++. Take an array (e.g. of unsigned int) of desired size, and use the read/write library calls and be done with it. In Haskell it didn't seem so easy, at least to me.

I'd appreciate if your solution uses the best possible standard packages that are available with mainstream Haskell (> GHC 7.10) and not some obscure/obsolete ones.

I read from these pages

https://wiki.haskell.org/Binary_IO

https://wiki.haskell.org/Dealing_with_binary_data

mntk123
  • 905
  • 6
  • 18

3 Answers3

4

If you're doing binary I/O, you almost certainly want ByteString for the actual input/output part. Have a look at the hGet and hPut functions it provides. (Or, if you only need strictly linear access, you can try using lazy I/O, but it's easy to get that wrong.)

Of course, a byte string is just an array of bytes; your next problem is interpreting those bytes as character / integers / doubles / whatever else they're supposed to be. There are a couple of packages for that, but Data.Binary seems to be the most mainstream one.

The documentation for binary seems to want to steer you towards using the Binary class, where you write code to serialise and deserialise whole objects. But you can use the functions in Data.Binary.Get and Data.Binary.Put to deal with individual items. There you will find functions such as getWord32be (get Word32 big-endian) and so forth.

I don't have time to write a working code example right now, but basically look at the functions I mention above and ignore everything else, and you should get some idea.

Now with working code:

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import Data.Binary.Put
import Control.Monad
import System.IO

main = do
  h_in  <- openFile "Foo.bin" ReadMode
  h_out <- openFile "Bar.bin" WriteMode
  replicateM 1000 (process_chunk h_in h_out)
  hClose h_in
  hClose h_out

chunk_size = 1000
int_size = 4

process_chunk h_in h_out = do
  bin1 <- BIN.hGet h_in chunk_size
  let ints1 = runGet (replicateM (chunk_size `div` int_size) getWord32le) bin1
  let ints2 = map (\ x -> if x < 1000 then 2*x else x) ints1
  let bin2 = runPut (mapM_ putWord32le ints2)
  BIN.hPut h_out bin2

This, I believe, does what you asked for. It reads 1000 chunks of chunk_size bytes, converts each one into a list of Word32 (so it only ever has chunk_size / 4 integers in memory at once), does the calculation you specified, and writes the result back out again.

Obviously if you did this "for real" you'd want EOF checking and such.

MathematicalOrchid
  • 61,854
  • 19
  • 123
  • 220
  • 1
    6 libraries imported! "..you'd want EOF checking.." - I wonder: does EOF checking need some other packages like Data.Binary.EOF or some exception Monad? and just to add that _simple EOF check_ one needs to add another 100 lines of code ? I hope it's not so dire. I am trying to add EOF checking, once able to accomplish that _feat_, I will mark your solution accepted. Sadly, the official documentation doesn't provide a single working example. – mntk123 Aug 28 '15 at 02:47
  • another point, is this a standard solution? because somewhere I read that if you want to do any serious IO in Haskell, you should not use System.IO as it's unreliable. – mntk123 Aug 28 '15 at 04:27
  • It turns out the import of `Data.Word` isn't actually needed in this particular example. You can also remove `Control.Monad` if you stop using `replicateM`. You can check for EOF using `hIsEOF`, which is in `System.IO` (which is already imported). It's just one function call. Oh, and I have *no idea* where you got the impression that `System.IO` is "unreliable". Perhaps whoever said that was talking about a specific function *within* that module? – MathematicalOrchid Aug 28 '15 at 07:09
  • (1) "6 libraries imported!" - six modules, but only three libraries/packages: `base`, `bytestring` and `binary`. All of them are bundled with GHC, so they are as standard as you'll get. In any case, small modules and features factored out small libraries is the norm rather than the exception in Haskell, even for essential things such as what `bytestring` provides. (2) On EOF, beyond `hIsEOF` you may also find `Control.Exception` useful. It offers familiar tools such as `try`, `catch` and `finally` in convenient forms. – duplode Aug 28 '15 at 07:56
  • (3) On "Is `System.IO` unreliable?", you probably read about *lazy* IO, which refers to specific functions such as `readFile` and `hGetContents`. They aren't used in this solution or in user5402's (but they are in blaze's). Lazy IO is convenient, but it has some pitfalls. For a nuanced discussion, see [this question](http://stackoverflow.com/questions/5892653). (Note that the ecosystem has evolved since 2011. Modern alternatives to lazy IO include the `pipes`, `conduit` and `io-streams` libraries. There is no need to check them right now - being aware of the issue will do for the moment.) – duplode Aug 28 '15 at 08:15
  • "Contrast this with lazy I/O, which provides for constant memory usage, but at the expense of deterministic resource usage. ... It is highly recommended to use Data.Conduit.Binary for real-world use cases." [fpcomplete](https://www.fpcomplete.com/school/to-infinity-and-beyond/pick-of-the-week/conduit-overview) As usual, any simple to get started with example is not given :(. Maybe the lazy IO concern doesn't apply to my example (or may be it does) I don't know. – mntk123 Aug 28 '15 at 08:26
  • Okay, I got the EOF part working. So accepting your answer. Though I am facing another problem with EOF. I will ask another question for it. – mntk123 Aug 28 '15 at 08:32
  • @mntk123 This solution doesn't use lazy IO, so that is not a concern here (see also my comments above - I forgot to ping you in them). Streaming libraries (`conduit`, `pipes`, etc.) have other advantages beyond avoiding the lazy IO gotchas (e.g. processing steps are easier to compose into a complex pipeline), though I guess such advantages would only be noticeable in use cases somewhat trickier than yours. – duplode Aug 28 '15 at 08:58
  • @MathematicalOrchid Sorry to unaccept the answer but the replicateM is too much confusing to me. Maybe I am dumb. But I cannot understand the code at all as how and why you put 1000 in the `replicateM 1000 (process_chunk h_in h_out)`. Has it to do anything with chunk_size? please explain it. – mntk123 Aug 28 '15 at 15:26
  • 1
    @mntk123 `replicateM 1000` means "read 1000 chunks". Whereas `chunk_size` is how many bytes are in a chunk. The numbers aren't special; I just picked example values. In reality, you probably wouldn't real 1000 chunks, you'd write a loop that actually checks for EOF or whatever. – MathematicalOrchid Aug 28 '15 at 15:41
2

Best way to work with binary I/O in Haskell is by using bytestrings. Lazy bytestrings provide buffered I/O, so you don't even need to care about it.

Code below assumes that chunk size is a multiple of 32-bit (which it is).

module Main where

import Data.Word
import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString as BStrict

-- Convert one bytestring chunk to the list of integers
-- and append the result of conversion of the later chunks.
-- It actually appends only closure which will evaluate next
-- block of numbers on demand.
toNumbers :: BStrict.ByteString -> [Word32] -> [Word32]
toNumbers chunk rest = chunkNumbers ++ rest
    where
    getNumberList = replicateM (BStrict.length chunk `div` 4) getWord32le
    chunkNumbers = runGet getNumberList (BS.fromStrict chunk)

main :: IO()
main = do
    -- every operation below is done lazily, consuming input as necessary
    input <- BS.readFile "in.dat"
    let inNumbers = BS.foldrChunks toNumbers [] input
    let outNumbers = map (\x -> if x < 1000 then 2*x else x) inNumbers
    let output = runPut (mapM_ putWord32le outNumbers)
    -- There lazy bytestring output is evaluated and saved chunk
    -- by chunk, pulling data from input file, decoding, processing
    -- and encoding it back one chunk at a time
    BS.writeFile "out.dat" output
blaze
  • 4,326
  • 18
  • 23
0

Here is a loop to process one line at a time from stdin:

import System.IO

loop = do b <- hIsEOF stdin
          if b then return ()
               else do str <- hGetLine stdin
                       let str' = ...process str...
                       hPutStrLn stdout str'

Now just replace hGetLine with something that reads 4 bytes, etc.

Here is the I/O section for Data.ByteString:

https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#g:29

ErikR
  • 51,541
  • 9
  • 73
  • 124
  • hGetLine seems to read a line. The files I wish to read may have only one line of billion integers placed back to back. in other words, the `newline` if appears, *is* actually a part of the data value (integer) – mntk123 Aug 27 '15 at 16:16
  • 2
    I said: _replace `hGetLine` with something that reads 4 bytes_ and you should be able to find such a function in the link I provided. – ErikR Aug 27 '15 at 16:21
  • 1
    sorry, but precisely that's what I am confused about. :( can you modify your code so that it does that something magic? – mntk123 Aug 27 '15 at 16:41
  • How about trying the [`hGet`](https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#v:hGet) function listed on that page. – ErikR Aug 27 '15 at 18:30