15

I want to write a simple function which splits a ByteString into [ByteString] using '\n' as the delimiter. My attempt:

import Data.ByteString

listize :: ByteString -> [ByteString]
listize xs = Data.ByteString.splitWith (=='\n') xs

This throws an error because '\n' is a Char rather than a Word8, which is what Data.ByteString.splitWith is expecting.

How do I turn this simple character into a Word8 that ByteString will play with?

Xander Dunn
  • 2,349
  • 2
  • 20
  • 32

1 Answers1

17

You could just use the numeric literal 10, but if you want to convert the character literal you can use fromIntegral (ord '\n') (the fromIntegral is required to convert the Int that ord returns into a Word8). You'll have to import Data.Char for ord.

You could also import Data.ByteString.Char8, which offers functions for using Char instead of Word8 on the same ByteString data type. (Indeed, it has a lines function that does exactly what you want.) However, this is generally not recommended, as ByteStrings don't store Unicode codepoints (which is what Char represents) but instead raw octets (i.e. Word8s).

If you're processing textual data, you should consider using Text instead of ByteString.

ehird
  • 40,602
  • 3
  • 180
  • 182
  • Oh, wow. Excellent. I will have to dig into character representations, I guess. I have no idea what the numerical literals for the characters are. Is there a list of them somewhere? – Xander Dunn Jan 23 '12 at 01:52
  • I am writing a program that will parse protein database files, which contain strings, integers, and doubles. The strings will mostly be used to identify the right items out of a list, whereas the ints and doubles will be used in math operations. I am not sure what class I should use for this. – Xander Dunn Jan 23 '12 at 01:55
  • 1
    You could use `ord` in GHCi to find out the codepoint numbers of characters :) I generally get Unicode data from [fileformat.info](http://www.fileformat.info/info/unicode/index.htm); the [Basic Latin](http://www.fileformat.info/info/unicode/block/basic_latin/index.htm) block contains the 128 codepoints inherited from ASCII. – ehird Jan 23 '12 at 01:59
  • 2
    As for the appropriate type for your program, it depends on the specific format and what you're doing, but if they don't contain any binary data, then `Text` would work fine. However, if the strings are always pure ASCII, and you're processing a large amount of data, then `ByteString` is likely to be faster. – ehird Jan 23 '12 at 02:01
  • Yes, the files are strictly ASCII, and performance is the goal. Thank you. – Xander Dunn Jan 23 '12 at 02:03
  • 1
    How do I create a Word8 now? – peer Feb 13 '16 at 20:08