3

I have a UTF-8 string given as a null-terminated const char*. I would like to know if the first letter of this string is an a by itself. The following code

bool f(const char* s) {
  return s[0] == 'a';
}

is wrong, as the first letter (grapheme cluster) of the string might be à - made from 2 unicode scalar values: a and `. So this very simple question seems extremely difficult to answer, unless you know how grapheme clusters are made.

Still, many libraries parse UTF-8 files (YAML files, for instance) and therefore should be able to answer this kind of question. But these libraries don't seem to depend upon a Unicode library.

So my question are:

  • How to write code that checks if a string starts with the letter a?

  • Assuming that there is no simple answer to the first question, how do parsers (such as YAML parsers) manage to parse files without being able to answer this kind of question?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
InsideLoop
  • 6,063
  • 2
  • 28
  • 55
  • 2
    Why would those libraries need to know about grapheme clusters? Text serialization tends to be defined in terms of codepoints. Can’t imagine that YAML is an exception. – Ry- Jun 19 '17 at 19:34
  • There's a difference between parsing and validating utf-8, and interpreting it correctly. The former is easy, the latter not so much. – James Jun 19 '17 at 19:34
  • Decode the UTF-8 to Unicode, [normalize it](http://unicode.org/reports/tr15/) using NFC or NFKC to reduce combining marks, then test the normalized data as needed. – Remy Lebeau Jun 19 '17 at 23:06

4 Answers4

5

It simply doesn't matter.

Consider: Is this string valid JSON?

"̀"

(That's the byte sequence 22 cc 80 22.)

You seem to be arguing that it is not: Since a JSON string should start with " (QUOTATION MARK) but instead this starts with (QUOTATION MARK + COMBINING GRAVE ACCENT).

The only reasonable response is that you're thinking at the wrong level: Text serialization is defined in terms of code points. Grapheme clusters are only considered for processing natural language and editing text.

And this certainly is considered valid JSON.

>>> json.loads(bytes.fromhex('22cc8022'))
'̀'
Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • I thought that text serialization was defined in terms of characters (or graphme clusters), not in terms of code points. You example, of the string is clear. I would have thought that it was not a valid string. That's why I was very puzzled. Now everything is clear. – InsideLoop Jun 19 '17 at 19:51
  • As a related question, how does regular expression work? Does "^a.*" matches "a + (COMBINING GRAVE ACCENT)"? – InsideLoop Jun 19 '17 at 19:57
  • @InsideLoop: Yes, it does match. – R.. GitHub STOP HELPING ICE Jun 19 '17 at 20:14
  • @InsideLoop: "characters (or graphme clusters)" is incorrect; these are not remotely equivalent concepts. Unicode characters are roughly equivalent to codepoints (with some technical considerations; not all codepoints correspond to characters, etc.). – R.. GitHub STOP HELPING ICE Jun 19 '17 at 20:15
  • @R: Ok. Swift (from Apple) uses characters and grapheme clusters the same way. They might be wrong. – InsideLoop Jun 19 '17 at 20:21
  • @InsideLoop: If so, Apple is just wrong in their use of Unicode terminology. It's probably a consequence of their use of NFD, which is a highly unnatural representation relative to user expectation. See my answer for related info. – R.. GitHub STOP HELPING ICE Jun 20 '17 at 00:03
2

How to write a code that checks if a string starts with the letter a?

There is no simple answer to this. To answer this question, you would need to be test the Unicode CCC property of a codepoint. If it's non-zero, then it is a combining character.

Of course, C has no API for doing so.

How do parsers (such as YAML parsers) manage to parse files without being able to answer this kind of question.

This is not a question they need to answer. Why? Because they never ask it.

If YAML is reading a key, then it reads up until the name terminating character (like :). A Unicode combining character cannot combine through such a character, and the YAML specification doesn't care if there's a combining character on the other side of the :. If it sees a :, then it knows that it has reached the end of the name, and everything before that is a key.

If it's reading a text string, then it similarly keeps reading until it reads a terminating character or character sequence.

Parsing text with most text formats is based on regular expression matching (or something similar) against some terminating condition. That is, a string would be any of some set of characters (alternative, all characters except for some set), up to the terminus character(s).

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Thanks. I thought that it was looking for the character (or grapheme cluster) `:`, not the code point `:`. Now, I have a better understanding of those notions. Thanks. – InsideLoop Jun 19 '17 at 19:49
1

s[0] == 'a' is the correct test for whether the first character is a. If a string contains a decomposed version of à, that would be two characters, a and the combining grave. Up until Apple decided to enforce NFD everywhere, this was basically a non-issue, because people who wanted à to be treated as a character/letter by itself would enter it as one, and people who wanted it as an a with a mark attached would enter it as two. Yes, this goes against the Unicode intent of canonical equivalence, but the Unicode intent of canonical equivalence largely goes against user expectation and intent (not to mention existing text & text processing models).

If you really want to check that the first character is an a and is not followed by any combining marks, this should work:

wchar_t tmp = WEOF;
mbrtowc(&tmp, s+1, MB_LEN_MAX, &(mbstate_t){0});
if (tmp && wcwidth(tmp)==0) {
    /* character following 'a' is a combining mark */
}

This depends on the POSIX wcwidth function, but you can find portable versions of it or write your own based on the Unicode tables (really you could write a simpler function that only checks for combining status, not also the East Asian Width property).

To answer your second question about parsers, they don't have any reason to know or care about the issue you're concerned about. File formats like yaml, json, etc. are not subject to canonical equivalence (at least not at the parsing level; the content stored in the file, which applications will interpret, might be subject to it). A string that is a different sequence of Unicode characters, even if it would be canonically equivalent, is a different string that compares not-equal.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
-1

Here is a code that checks if an utf8 string starts with the letter 'a'?

bool f(const char* s) {

        if (s[0] == 'a') return true;

        if (strlen(s) >= 2 && s[0] == '\xc3') {
                char s1 = s[1];
                if (s1 == '\x80') return true; // LATIN CAPITAL LETTER A WITH GRAVE
                if (s1 == '\x81') return true; // LATIN CAPITAL LETTER A WITH ACUTE
                if (s1 == '\x82') return true; // LATIN CAPITAL LETTER A WITH CIRCUMFLEX
                if (s1 == '\x83') return true; // LATIN CAPITAL LETTER A WITH TILDE
                if (s1 == '\x84') return true; // LATIN CAPITAL LETTER A WITH DIAERESIS
                if (s1 == '\x85') return true; // LATIN CAPITAL LETTER A WITH RING ABOVE

                if (s1 == '\xa0') return true; // LATIN SMALL LETTER A WITH GRAVE
                if (s1 == '\xa1') return true; // LATIN SMALL LETTER A WITH ACUTE
                if (s1 == '\xa2') return true; // LATIN SMALL LETTER A WITH CIRCUMFLEX
                if (s1 == '\xa3') return true; // LATIN SMALL LETTER A WITH TILDE
                if (s1 == '\xa4') return true; // LATIN SMALL LETTER A WITH DIAERESIS
                if (s1 == '\xa5') return true; // LATIN SMALL LETTER A WITH RING ABOVE
        }
        return false;
}
user803422
  • 2,636
  • 2
  • 18
  • 36