Get a `Char` from a `ByteString`

Question

Is there a way to get the first UTF-8 Char in a ByteString in O(1) time? I'm looking for something like

headUtf8 :: ByteString -> Char
tailUtf8 :: ByteString -> ByteString

I'm not yet constrained to use strict or lazy ByteString, but I'd prefer strict. For lazy ByteString, I can cobble something together via Text, but I'm not sure how efficient (especially space-complexity wise) this is.

import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Encoding (decodeUtf8With, encodeUtf8)
import Data.Text.Encoding.Error (lenientDecode)

headUtf8 :: ByteString -> Char
headUtf8 = T.head . decodeUtf8With lenientDecode

tailUtf8 :: ByteString -> ByteString
tailUtf8 = encodeUtf8 . T.tail . decodeUtf8With lenientDecode

In case anyone is interested, this problem arises when using Alex to make a lexer that supports UTF-8 characters¹.

¹ I am aware that since Alex 3.0 you only need to provide alexGetByte (and that is great!) but I still need to be able to get characters in other code in the lexer.

Dietrich Epp · Accepted Answer · 2016-11-04T07:07:04.947

4

You want the Data.Bytestring.UTF8 module in the utf8-string package. It contains an uncons function with the following signature:

uncons :: ByteString -> Maybe (Char, ByteString)

You can then define:

headUtf8 :: ByteString -> Char
headUtf8 = fst . fromJust . uncons

tailUtf8 :: ByteString -> ByteString
tailUtf8 = snd . fromJust . uncons

edited Nov 04 '16 at 07:07

answered Nov 04 '16 at 06:59

Dietrich Epp

205,541
37
345
415

I did not know this package existed, but this is exactly what I was looking for. This means I can completely eliminate any dependency on `text`. – Alec Nov 04 '16 at 07:23
Wow! This tiny library has precisely the functionality I need for my lexer. Thanks a ton. – Alec Nov 04 '16 at 07:33
Just keep in mind these functions are partial; they are undefined on `Data.ByteString.empty`. – chepner Nov 04 '16 at 12:54

score 0 · Answer 2 · edited May 23 '17 at 10:27

0

The longest UTF-8 encoding is 6 bytes, so if we try 1, 2, ... bytes, it will finish at least at the 6th step, thereby being O(1):

import Data.Text as Text
import Data.Text.Encoding as Text
import Data.ByteString as BS

splitUtf8 :: ByteString -> (Char, ByteString)
splitUtf8 bs = go 1
  where
    go n | BS.null slack = (Text.head t, bs')
         | otherwise = go (n + 1)
      where
        (bs1, bs') = BS.splitAt n bs
        Some t slack _ = Text.streamDecodeUtf8 bs1

For example, here's splitting a 2+3-byte ByteString:

*SO_40414452> splitUtf8 $ BS.pack[197, 145, 226, 138, 162]
('\337',"\226\138\162")

and here a 3+2-byte one:

*SO_40414452> splitUtf8 $ BS.pack[226, 138, 162, 197, 145]
('\8866',"\197\145")

edited May 23 '17 at 10:27

Community

1
1

answered Nov 04 '16 at 04:32

Cactus

27,075
9
69
149

2

The longest UTF-8 encoding is 4 bytes. 5 and 6 byte encodings are invalid and have been invalid for many years. No characters were ever assigned which would have had 5 or 6 byte encodings. – Dietrich Epp Nov 04 '16 at 06:45
@DietrichEpp: thanks. My argument only needs the longest UTF-8 encoding to be a finite number :) – Cactus Nov 04 '16 at 06:46

Get a `Char` from a `ByteString`

2 Answers2