3

The problem I have is that given a sequence of bytes, I want to determine its longest prefix which forms a valid Unicode character (extended grapheme cluster) assuming UTF8 encoding.

I am using Swift, so I would like to use Swift's built-in function(s) to do so. But these functions only decode a complete sequence of bytes. So I was thinking to convert prefixes of the byte sequence via Swift and take the last prefix that didn't fail and consists of 1 character only. Obviously, this might lead to trying out the entire sequence of bytes, which I want to avoid. A solution would be to stop trying out prefixes after 4 prefixes in a row failed. If the property asked in my question holds, this would then guarantee that all longer prefixes must also fail.

I find the Unicode Text Segmentation Standard unreadable, otherwise I would try to directly implement boundary detection of extended grapheme clusters...

Steven Obua
  • 934
  • 1
  • 8
  • 17
  • "implement boundary detection of extended grapheme clusters." haha you *don't* want to do that. That way lies madness. Honestly using [`init?(validatingUTF8:)`](https://developer.apple.com/documentation/swift/string/1641706-init) to check incrementally longer prefixes isn't a bad idea. While extended grapheme clusters *can* get pathologically big (try copying [this answer](https://stackoverflow.com/a/1732454/3141234) into a [unicode inspector](https://apps.timwhitlock.info/unicode/inspect) lol), the vast majority of them are tiny. – Alexander Apr 19 '20 at 22:14
  • I definitely don't want to do that :-) But that's why I am asking the question! I like my code to have predictable performance even in an error case. – Steven Obua Apr 19 '20 at 22:17
  • It looks like the perfect solution would a function like `findLongestValidCharacter(in input: [UnicodeScalar]) -> Character`. Even if such a function existed, Pathologically long characters would still be an issue. The "bad" cases would be less bad given that the function could be `O(input.count)` rather than the `O(input.count^2)` of the brute force approach, but that can still get bad. – Alexander Apr 19 '20 at 22:21
  • You could try spelunking into the standard library implementation, and see how it does its cluster breaking: https://github.com/apple/swift/blob/7039277e4d0af3afe8d793d3ee4ea7a72e03ff72/stdlib/public/core/CString.swift#L89 – Alexander Apr 19 '20 at 22:23
  • I am not so much worried about actual valid characters that have very long code point sequences. If that happens, oh well. But in my scenario it can very well be that the byte sequence is not a UTF-8 string at all. If that byte sequence is 80GB long, I don't want to check it all to find out that it isn't valid. – Steven Obua Apr 19 '20 at 22:25
  • Then your idea of capping the max length attempted is good! – Alexander Apr 19 '20 at 22:27
  • Yup, I think I will have to somehow do the cluster breaking myself, either from the Standard, or from looking at existing implementations. – Steven Obua Apr 19 '20 at 22:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212034/discussion-between-alexander-reinstate-monica-and-steven-obua). – Alexander Apr 19 '20 at 22:27
  • @StevenObua If you post a byte sequence and the expected result it would be much easier to understand what you are trying to accomplish – Leo Dabus Apr 19 '20 at 22:37
  • It's pretty hard to post a byte sequence of which I don't know if it exists, Leo ;-) – Steven Obua Apr 19 '20 at 22:46
  • Also important in my context: Can Unicode strings become shorter (in terms of number of extended grapheme clusters) when you add code points to them? It seems to me that unfortunately this might be true and it should be easy to construct an example for this. – Steven Obua Apr 19 '20 at 22:54
  • That pointer you posted is hilarious by the way, Alexander – Steven Obua Apr 19 '20 at 23:00
  • Hmmh. Not that easy to construct that example after all. – Steven Obua Apr 19 '20 at 23:16

1 Answers1

0

After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules, it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.

So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):

    func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
        // This code works under three assumptions, all of which are true:
        // 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
        // 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
        // 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
        var chars : [UInt8] = []
        var result : String = ""
        var resultLength = 0
        func value() -> (length : Int, out : Character)? {
            guard let character = result.first else { return nil }
            return (length: resultLength, out: character)
        }
        var length = 0
        var iterator = input.makeIterator()
        while length - resultLength <= 4 {
            guard let char = iterator.next() else { return value() }
            chars.append(char)
            length += 1
            guard let s = String(bytes: chars, encoding: .utf8) else { continue }
            guard s.count == 1 else { return value() }
            result = s
            resultLength = length
        }
        return value()
    }
Steven Obua
  • 934
  • 1
  • 8
  • 17