22

Can anyone explain the pros and cons to using Data.Textand Data.ByteString.Char8 data types? Does working with ASCII-only text change these pros and cons? Do their lazy variants change the story as well?

Thomas Eding
  • 35,312
  • 13
  • 75
  • 106

1 Answers1

30

Data.ByteString.Char8 provides functions to treat ByteString values as sequences of 8-bit ASCII characters, while Data.Text is an independent type supporting the entirety of Unicode.

ByteString and Text are essentially the same, as far as representation goes — strict, unboxed arrays with lazy variants based on lists of strict chunks. The main difference is that ByteString stores octets (i.e. Word8s), while Text stores Chars, encoded in UTF-16.

If you're working with ASCII-only text, then using Data.ByteString.Char8 will probably be faster than Text, and use less memory; however, you should ask yourself whether you're really sure that you're only ever going to work with ASCII. Basically, in 99% of cases, using Data.ByteString.Char8 over Text is a speed hack — octets aren't characters, and any Haskeller can agree that using the correct type should be prioritised over raw, bare-metal speed. You should usually only consider it if you've profiled the program and it's a bottleneck. Text is well-optimised, and the difference will probably be negligible in most cases.

Of course, there are non-speed-related situations in which Data.ByteString.Char8 is warranted. Consider a file containing data that is essentially binary, not text, but separated into lines; using lines is completely reasonable. Additionally, it's entirely conceivable that an integer might be encoded in ASCII decimal in the context of a binary format; using readInt would make perfect sense in that case.

So, basically:

  1. Data.ByteString.Char8: For pure ASCII situations where performance is paramount, and to handle "almost-binary" data that has some ASCII components.
  2. Data.Text: Text, including any situation where there's the slightest possibility of something other than ASCII being used.
ehird
  • 40,602
  • 3
  • 180
  • 182
  • I can guarantee there will be ASCII-only text, as my program processes very specific computer generated C files. I'll try both out in any case. – Thomas Eding Jan 18 '12 at 19:54
  • I would probably go for `Data.ByteString.Char8`, then, as you'll essentially be dealing with a binary format that only *resembles* text. (I'd also recommend checking out [attoparsec](http://hackage.haskell.org/package/attoparsec) for parsing the files.) – ehird Jan 18 '12 at 20:01
  • You also mention that Text encodes as UTF-16 and ByteString as an octet. Does this in general impact memory usage? My application is a code rewriter, and as it is, it uses enormous amounts of memory that I can trace to using String. I already intern my strings, so any improvement would be welcome. This is why I want to change data types. – Thomas Eding Jan 18 '12 at 20:03
  • 2
    @trinithis: Well, if your data is all ASCII, then `Text` will encode each character as two bytes, but `ByteString` will encode them as one. If you're currently using `String`, though, I wouldn't worry too much about it; `String` has *huge* overhead (5 words per character(!)), far more than the other two. See [this summary of memory footprints](http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html). – ehird Jan 18 '12 at 20:07
  • 3
    @trinithis: Though, of course, you should bear in mind that `String` benefits from sharing, while `ByteString` and `Text`, as unboxed arrays, don't; however, `ByteString` and `Text` both take substrings without copying, and they're just so much smaller to start with that you'd have to try pretty hard to make that disadvantage matter. – ehird Jan 18 '12 at 20:10