Easy way to read UTF-8 characters from a binary file?

Question

Here is my problem: I have to read "binary" files, that is, files which have varying "record" sizes, and which may contain binary data, as well as UTF-8-encoded text fields.

Reading a given number of bytes from an input file is trivial, but I was wondering if there were functions to easily read a given number of characters (not bytes) from a file ? Like, if I know I need to read a 10-characters field (encoded in UTF-8, it would be at least 10 bytes long, but could go up to 40 or more, if we're talking "high" codepoints).

I emphasize that I'm reading a "mixed" file, that is, I cannot process it whole as UTF-8, because the binary fields have to be read without being interpreted as UTF-8 characters.

So, while doing it by hand is pretty straightforward (the byte-by-byte, naïve approach, isn't hard to implement - even though I'm dubious about the efficiency), I'm wondering if there are better alternatives out there. If possible, in the standard library, but I'm open to 3rd party code too - if my organization validates its use.

Really need more detail of the format of the binary file. Are the text fields fixed width or varying length with either a length indicator or null termination? I'd consider it bad design to have a binary file indicate width in Unicode code points. — Mark Tolonen, Jul 08 '21 at 15:35
Why don't you `mmap` the file? Then, you can easily interchange byte-at-a-time for UTF8 and binary later It is the _fastest_ way to read a file. See my answer: https://stackoverflow.com/questions/33616284/read-line-by-line-in-the-most-efficient-way-platform-specific/33620968#33620968 — Craig Estey, Jul 08 '21 at 15:37
[This old question](https://stackoverflow.com/questions/15199675/reading-utf-8-strings-from-a-binary-file) (suggested under "Related" in the right sidebar) looks like it's exactly what you're trying to do. — Steve Summit, Jul 08 '21 at 16:18
Note that getwc can not return full Unicode code points when using MSVC. MSVC's wint_t is an unsigned short so if your data actually contains non-BMP characters you'll get surrogates at the very best (I haven't confirmed whether that much actually works or if it will generate an error of some kind). — SoronelHaetir, Jul 08 '21 at 18:06
What do you mean by "binary fields"? What delimits these fields? UTF-8 has a pretty fixed pattern: any "new" byte with upper 4 bits of C/D/E/F is the first byte of a 2/3/4-byte character sequence, etc. The unknown right now is your "binary" data. — Old Geezer, Jul 12 '21 at 14:55
@OldGeezer: I mean exactly what I wrote. These are not UTF-8 encoded parts, those are raw binary values. Hence, I cannot decode everything at once as UTF-8, I have to know what I'm reading first, in order to know whether to read it as UTF-8, or "raw" binary instead. — Kzwix, Jul 13 '21 at 15:08

score 1 · Accepted Answer · answered Jul 08 '21 at 15:59

1

Here are two possibilities:

(1) If (but typically only if) your locale is set to handle UTF-8, the getwc function should read exactly one UTF-encoded Unicode character, even if it's multiple bytes long. So you could do something like

setlocale(LC_CTYPE, "UTF-8");
wint_t c;

for(i = 0; i < 10; i++) {
    c = getwc(ifp);
    /* do something with c */
}

Now, c here will be a single integer containing a Unicode codepoint, not a UTF-8 multibyte sequence. If (as is likely) you want to store UTF-8 strings in your in-memory data structure(s), you'd have to convert back to UTF-8, likely using wctomb.

(2) You could read N bytes from the input, then convert them to a wide character stream using mbstowcs. This isn't perfect, either, because it's hard to know what N should be, and the wide character string that mbstowcs gives you is, again, probably not what you want.

But before exploring either of these approaches, the question really is, what is the format of your input? Those UTF-encoded fragments of text, are they fixed-size, or does the file format contain an explicit count saying how big they are? And in either case, is their size specified in bytes, or in characters? Hopefully it's specified in bytes, in which case you don't need to do any conversion to/from UTF-8, you can just read N characters using fread. If the count is specified in terms of characters (which would be kind of weird, in my experience), you would probably have to use something like my approach (1) above.

Other than a loop like in (1) above, I don't know of a simple, encapsulated way to do the equivalent of "read N UTF-8 characters, no matter how many bytes it takes".

answered Jul 08 '21 at 15:59

Steve Summit

45,437
7
70
103

From what I've been told, there are varying types of "records", and each record has its own format. One could be 10 UTF-8 characters, then 30 bytes of binary data, then 25 UTF-8 characters, another one could have an entirely different layout... I'll only know after I read the "magic" part of a given record, allowing me to know which record format to read. And the "magic" part isn't even at the first byte of a given record... – Kzwix Jul 08 '21 at 17:11
I guess I'll try those answers for now (yours, and the other leads mentioned in the comments). If necessary, I'll come back for more. Thanks to the people who answered ! – Kzwix Jul 08 '21 at 17:12
@Kzwix If you have any input into the design of this format, and if there's any possibility for change, you should urge them to *please* define those counts like 10 and 30 as being *bytes*, not characters. It'll make most processing tasks much easier. (It will however place some burden on the code which constructs the records, in that it can't blindly truncate at N bytes, but may have to limit itself to, say, N-3 bytes, if the next character happens to be a 4-byte UTF-8 character that won't fit.) – Steve Summit Jul 08 '21 at 17:57
Re "*If you have any input into the design of this format*", Or add a length prefix in bytes. You can still limit the length to 10 characters or require 10 characters, but add the encoded length of those characters in the file. – ikegami Jul 08 '21 at 19:08
Nope, format is already defined, and it's because it's a conversion from old EBCDIC-encoded files. So, as not to lose anything, we get the same number of characters, but encoded in UTF-8 instead. (And they didn't bother adding length prefix, either) – Kzwix Jul 09 '21 at 08:41
1

@Kzwix Well, you have my sympathies, because I've had to deal with formats that were manifestly *not* designed for ease of reading, also. (As a commenter in [that older thread](https://stackoverflow.com/questions/15199675) said, "wow, that'd be a really terrible format.") – Steve Summit Jul 09 '21 at 18:37
If I'm understanding you correctly, the text fields are a fixed number of UTF-encoded characters long, and are followed immediately by another binary field. So the text field is effectlively variable-length, but the only way to know how long it is (or to read the following binary field correctly) is to parse the UTF-8 characters perfectly. What fun. – Steve Summit Jul 09 '21 at 18:37
Yup, you understood correctly. I'm having the time of my life ;) – Kzwix Jul 11 '21 at 15:32

ikegami · Answer 2 · 2021-07-08T23:59:48.217

1

You could also use something like this:

static unsigned char num_most_significant_ones[] = {
    /* 80 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* 90 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* A0 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* B0 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* C0 */   2, 2, 2, 2, 2, 2, 2, 2,   2, 2, 2, 2, 2, 2, 2, 2,
    /* D0 */   2, 2, 2, 2, 2, 2, 2, 2,   2, 2, 2, 2, 2, 2, 2, 2,
    /* E0 */   3, 3, 3, 3, 3, 3, 3, 3,   3, 3, 3, 3, 3, 3, 3, 3,
    /* F0 */   4, 4, 4, 4, 4, 4, 4, 4,   5, 5, 5, 5, 6, 6, 7, 8
};

static unsigned char lead_byte_data_mask[] = {
   0x7F, 0, 0x1F, 0x0F, 0x07, 0x03, 0x01
};

static int32_t min_by_len[] = {
   -1, 0x00, 0x80, 0x800, 0x10000ULL
}

// buf must be capable of accommodating at least 4 bytes.
// Returns 0 on EOF or read error.
size_t read_one_utf8_char(FILE* stream, char* buf) {
   int lead = getc(stream);
   if (lead == EOF)
      return 0;

   buf[0] = lead;
   if (lead < 0x80)
      return 1;

   unsigned len = num_most_significant_ones[ lead - 0x80 ];
   if (len == 1 || len > 6)
      goto ERROR;

   unsigned char mask = lead_byte_data_mask[len];
   uint32_t cp = lead & mask;
   for (int i=1; i<len; ++i) {
      int ch = getc(stream);  // Premature EOF or error.
      if (ch == EOF)
         goto ERROR;
      if ((ch & 0xC0) != 0x80) {  // Premature end of character.
         ungetc(ch, stream);
         goto ERROR;
      }
      cp = (cp << 6) | (ch & 0x3F);
      if (i < 4)
         buf[i] = ch;
   }

   if (len > 4 || cp < min_by_len[len] || ( cp >= 0xD800 && cp < 0xE000 ) || cp >= 0x110000)
      goto ERROR;

   return len;

ERROR:
   // Return U+FFFD.
   buf[0] = 0xEF;
   buf[1] = 0xBF;
   buf[2] = 0xBD;
   return 3;
}

Unlike getwc, this returns UTF-8.

Also, it validates, replacing illegal sequences with U+FFFD. (It doesn't replace noncharacters.^[1]^[2]) I don't know if getwc does that.

Untested.

edited Jul 08 '21 at 23:59

answered Jul 08 '21 at 18:43

ikegami

367,544
15
269
518

Validation is a surprisingly complicated issue. You can't even say whether `getwc` does it, as it will depend on (a) the implementation, and perhaps also (b) the current locale. Overlong encodings are only one of the things you need to check for: there's also (1) flatly illegal 5- and 6- byte sequences (which are not part of the currently standard UTF-8 definition), (2) illegal 4-byte sequences (those above 0x110000), and (3) surrogates (0xD800-0xF8FF). – Steve Summit Jul 08 '21 at 18:57
@Steve Summit, Re "*flatly illegal 5- and 6- byte sequences*", That's handled. The check for >6 was accidentally removed, but re-added. – ikegami Jul 08 '21 at 18:58
@Steve Summit, Re "*illegal 4-byte sequences (those above 0x110000), and (3) surrogates (0xD800-0xF8FF).*", You are referring to [noncharacters](http://www.unicode.org/faq/private_use.html#noncharacters). You missed some, and they aren't actually illegal. See [Corrigendum #9](https://www.unicode.org/versions/corrigendum9.html). "*they are not illegal in interchange nor do they cause ill-formed Unicode text.*" This is now linked by the answer. – ikegami Jul 08 '21 at 19:00
2

The point is that this is all complicated enough that someone (like Kzwix) might reasonably decide that they needed to use a standard library implementation of some kind, rather than taking a chance and rolling their own. (But don't get me wrong -- I've rolled my own here, a couple of times.) – Steve Summit Jul 08 '21 at 19:02
@Steve Summit, Aye, if it's possible. – ikegami Jul 08 '21 at 19:03
It's possible that the UTF-8 spec is more strict about illegal characters than Corrigendum #9 is. I haven't researched this deeply, but [the Wikipedia article](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling) still says that 0xD800-0xF8FF and >=0x110000 are forbidden – Steve Summit Jul 08 '21 at 19:07
@Steve Summit, It's called "corrigendum" cause it's supposed to correct such misunderstandings :) But it's right about 0x110000. They're outside of Unicode. And my code doesn't check that. I didn't realize this could be expressed in 4 bytes. – ikegami Jul 08 '21 at 19:09
The prohibition against 0xD800-0xF8FF and >=0x110000 comes, at least, from [RFC 3629](https://datatracker.ietf.org/doc/html/rfc3629), which is still in force, and is an IETF rather than Unicode Consortium standard. – Steve Summit Jul 08 '21 at 19:14
@Steve Summit, Wait a sec, surrogates aren't noncharacters. argh. Well, I guess you've proved your point :) – ikegami Jul 08 '21 at 19:15
@Steve Summit Added checks for surrogates and outside of Unicode. – ikegami Jul 08 '21 at 19:40
Thanks for all the leads. I'll give it a try, and probably roll with calling getwc() repeatedly for now. If I need more performance, then maybe I'll try other ways, like reading ahead and buffering, decoding the bytestream myself, and so on. Also, I'll have to store the characters I read into Perl variable (so I'll have to adapt to the formats described in perlguts - I still have to read that in depth ;) ). So don't worry for now, I'll give another shout if I need further help ^^ – Kzwix Jul 09 '21 at 08:46
Re "*I'll have to store the characters I read into Perl variable*", From C? Changing an existing scalar is a bit tricky, but creating one is easy. If `s` is NUL-terminated UTF-8, then you can use `newSVpvn_utf8(s, strlen(s), 1)`. – ikegami Jul 09 '21 at 18:16

score 0 · Answer 3 · answered Jul 13 '21 at 15:19

Well, for now, I've settled on creating a function which allocates a buffer of size 4 * numberOfCharactersToRead + 1 (as a UTF-8 character is encoded at most on 4 bytes).

Then I fread() that much (or as much as I can, if I hit EOF). And then I merely test the upper bits to know whether I hit a 1-byte, 2-byte, 3-byte, or 4-byte character. I check the following bytes as needed, and note where it puts me.

After I read the required number of characters, I take note of the number of bytes it really took, and I then adjust back the file pointer if I had read more than needed. I also realloc() the buffer to downsize it to the needed size.

I'm pretty sure it's more efficient than calling getwc() repeatedly before converting the wchar_t back to UTF-8 (because, in the end, I need to keep it as a UTF-8 sequence, as I'm storing that data in a Perl scalar, and that's the way Perl does it internally).

I end the UTF-8 "string" I read with a 0 (hence the extra byte), in order to be able to print it with standard C functions, and that's that.

Also, to store the "raw binary" along with UTF-8 encoded text, when I concatenate them, I merely encode the binary bytes as UTF-8 codepoints. This way, under Perl, I get to treat a character or a "raw byte" the same way, as UTF-8 characters. I'll just have to get the "codepoint" value back when I need to work on a raw byte disguised as a character.

I know I hadn't mentioned Perl in the tags, but it didn't matter for the question, so I'm only mentioning it in order to provide some context as to why I went that way.

Thanks to all the people having posted helpful suggestions :)

Easy way to read UTF-8 characters from a binary file?

3 Answers3