1

I want to check whether a file is (likely) encoded in UTF-8. I don't want to use any external libraries (otherwise I would probably use Boost.Locale), just 'plain' C++17. I need this to be cross-platform compatible, at least on MS Windows and Linux, building with Clang, GCC and MSVC.

I am aware that such a check can only be a heuristic, since you can craft e.g. a ISO-8859 encoded file containing a weird combination of special charactes which yield a valid UTF-8 sequence (corresponding to probably equally weird, but different, unicode characters).

My best attempt so far is to use std::wstring_convert and std::codecvt<char16_t, char, std::mbstate_t> to attempt a conversion from the input data (assumed to be UTF-8) into something else (UTF-16 in this case) and handle a thrown std::range_error as "the file was not UTF-8". Something like this:

void check(const std::filesystem::path& path)
{
    std::ifstream ifs(path);

    if (!ifs)
    {
        return false;
    }

    std::string data = std::string(std::istreambuf_iterator<char>(ifs), std::istreambuf_iterator<char>());

    std::wstring_convert<deletable_facet<std::codecvt<char16_t, char, std::mbstate_t>>, char16_t>
        conv16;
    try
    {
        std::u16string str16 = conv16.from_bytes(data);
        std::cout << "Probably UTF-8\n";
    }
    catch (std::range_error&)
    {
        std::cout << "Not UTF-8!\n";
    }
}

(Note that the conversion code, as well as the not defined deletable_facet, is taken more or less verbatim from cppreference.)

Is that a sensible approach? Are there better ways that do not rely on external libraries?

Lukas Barth
  • 2,734
  • 18
  • 43
  • I'd use the approach of reading the file (or a portion thereof) and check if it's pure ASCII, check if it's valid UTF-8, check if it could be an 8-bit encoding (e.g., WIn1252, MacRoman, ISO-8859, et cetera — which **exact** particular encoding I don't think is possible to discern reliably). And if important for your use case: UTF-16/BE-or-LE, UCS-4/BE-or-LE. Maybe the file is just non-textual **binary** data. – Eljay Jan 10 '23 at 13:59
  • related/dupe: https://stackoverflow.com/questions/8654857/how-to-check-whether-text-file-is-encoded-in-utf-8 – NathanOliver Jan 10 '23 at 14:01
  • It is pretty easy to write own function which will check allowed ranges of byte sequences. If you need speed and portability. Or you can find ready OS implementation. – sklott Jan 10 '23 at 14:01
  • One thing you wont be able to tell is an ASCII file from a UTF8 file. – NathanOliver Jan 10 '23 at 14:01
  • 3
    The Wikipedia page on UTF-8 explains, very nicely, how it works. It shouldn't be very complicated to simply take the UTF-8 specification, directly off that, and write a simple byte validator that verifies that the byte stream is a valid UTF-8 byte stream. Have you familiarized yourself with how UTF-8 encoding works, and its rules? – Sam Varshavchik Jan 10 '23 at 14:13
  • You should probably additionally check for zero bytes in the string if you want to detect multibyte encodings like UTF-16 - zero byte is technically valid in UTF-8 but should not appear in text files. – dewaffled Jan 10 '23 at 14:16
  • @NathanOliver ASCII is fine - an ASCII-encoded file *is* UTF-8 encoded (at least in my concern). – Lukas Barth Jan 10 '23 at 15:01
  • 2
    Be aware that invalid UTF-8 can be [CESU-8](https://stackoverflow.com/a/63583222/4299358) which could be easily healed into valid UTF-8 in case you want a less strict approach and get as much data as possible (instead of an all-or-nothing approach). – AmigoJack Jan 10 '23 at 15:04
  • @SamVarshavchik I know how UTF-8 works and would probably be able to write a parser/verifier, however I was hoping not to have to do that myself. If you want to handle all cases correctly (BOM, zero bytes, …), this is not trivial. – Lukas Barth Jan 10 '23 at 15:05
  • There are very, very few magic buttons in C++ that only need to be located and pushed to make everything happen. This isn't one of them. – Sam Varshavchik Jan 10 '23 at 15:42
  • 1
    You will find the precise BNF definition of UTF-8 here: https://datatracker.ietf.org/doc/html/rfc3629 , page 4. – Michaël Roy Jan 13 '23 at 07:40

2 Answers2

1

The rules for UTF-8 are much more stringent than for UTF-16, and are quite easy to follow. The code below basically does BNF parsing to check the validity of a string. If you plan to check on streams, remember that the longest UTF-8 sequence is 6 bytes long, so if an error appears less that 6 bytes before the end of a buffer, you may have a truncated symbol.

NOTE: the code below is backwards-compatible with RFC-2279, the precursor to the current standard (defined in RFC-3629). If any of the text you plan to check could have been generated by software made before 2004, then use this, else if you need more stringent testing for RFC-3679 compliance, the rules can be modified quite easily.

#include <algorithm>
#include <cstddef>
#include <iostream>
#include <string>
#include <string_view>

size_t find_first_not_utf8(std::string_view s) {
    // ----------------------------------------------------
    // returns true if fn(c) returns true for all n first charac-ters c of
    // string src. the sring_voew is updated to exclude the first n characters
    // if a match is found, left untouched otherwise.
    const auto match_n = [](std::string_view& src, size_t n, auto&& fn) noexcept {
        if (src.length() < n) return false;

        if (!std::all_of(src.begin(), src.begin() + n, fn))
            return false;

        src.remove_prefix(n);
        return true;
    };

    // ----------------------------------------------------
    // returns true if fn(c) returns true for the first character c of
    // string src. the sring_view is updated to exclude the first character
    // if a match is found, left untouched otherwise.
    const auto match_1 = [](std::string_view& src, auto&& fn) noexcept {
        if (src.empty()) return false;

        if (!fn(src.front()))
            return false;

        src.remove_prefix(1);
        return true;
    };

    // ----------------------------------------------------
    // returns true if the first chatacter sequence of src is a valid non-ascii
    // utf8 sequece.
    // the sring_view is updated to exclude the first utf-8 sequence if non-ascii
    // sequence is found, left untouched otherwise.

    const auto utf8_non_ascii = [&](std::string_view& src) noexcept {
        const auto SRC = src;

        auto UTF8_CONT = [](uint8_t c) noexcept {
            return 0x80 <= c && c <= 0xBF;
        };

        if (match_1(src, [](uint8_t c) { return 0xC0 <= c && c <= 0xDF; }) &&
            match_1(src, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_1(src, [](uint8_t c) { return 0xE0 <= c && c <= 0xEF; }) &&
            match_n(src, 2, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_1(src, [](uint8_t c) { return 0xF0 <= c && c <= 0xF7; }) &&
            match_n(src, 3, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_1(src, [](uint8_t c) { return 0xF8 <= c && c <= 0xFB; }) &&
            match_n(src, 4, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_1(src, [](uint8_t c) { return 0xFC <= c && c <= 0xFD; }) &&
            match_n(src, 5, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        return false;
    };

    // ----------------------------------------------------
    // returns true if the first symbol of st(ring src is a valid UTF8 character
    // not-including control characters, nor space.
    // the sring_view is updated to exclude the first utf-8 sequence
    // if a valid symbol sequence is found, left untouched otherwise.

    const auto utf8_char = [&](std::string_view& src) noexcept {
        auto rule = [](uint8_t c) noexcept -> bool {
            return (0x21 <= c && c <= 0x7E) || std::isspace(c);
        };

        const auto SRC = src;

        if (match_1(src, rule)) return true;
        s = SRC;
        return utf8_non_ascii(src);
    };

    // ----------------------------------------------------

    const auto S = s;

    while (!s.empty() && utf8_char(s)) {
    }

    if (s.empty()) return std::string_view::npos;

    return size_t(s.data() - S.data());
}

void test(const std::string s) {
    std::cout << "testing \'" << s << "\": ";

    auto pos = find_first_not_utf8(s);

    if (pos < s.length())
        std::cout << "failed at offset " << pos << "\n";
    else
        std::cout << "OK\n";
}

auto greek = "Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι\n ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς ";
auto ethiopian = "ሰማይ አይታረስ ንጉሥ አይከሰስ።";

const char* errors[] = {
    "2-byte sequence with last byte missing (U+0000):   \xC0xyz",
    "3-byte sequence with last byte missing (U+0000):   \xe0\x81xyz",
    "4-byte sequence with last byte missing (U+0000):   \xF0\x83\x80xyz",
    "5-byte sequence with last byte missing (U+0000):   \xF8\x81\x82\x83xyz",
    "6-byte sequence with last byte missing (U+0000):   \xFD\x81\x82\x83\x84xyz"
};

int main() {
    test("hello world");
    test(greek);
    test(ethiopian);

    for (auto& e : errors) test(e);
    return 0;
}

You'll be able to play with the code here: https://godbolt.org/z/q6rbveEeY

Michaël Roy
  • 6,338
  • 1
  • 15
  • 19
  • 1
    Er, the longest valid UTF-8 is _four_ octets, not six. You only get six from encoding things like surrogate pairs. – Dúthomhas Jan 14 '23 at 17:14
  • It is 6, according to all the documentation I've read so far, The test for a 6 bytes sequence error comes from this popular utf-8 stress-test file: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt – Michaël Roy Jan 14 '23 at 17:23
  • 1
    [The UTF-8 Standard](https://www.ietf.org/rfc/rfc3629.txt) says four. (edit: I agree with accepting CESU-8. But your assertion that valid UTF-8 sequences may be six bytes is not correct.) – Dúthomhas Jan 14 '23 at 17:37
  • I understand wat you mean, the code above is backwards compatible with the original UFT8 definition in RFC 2279 (the precursor to RFC3629). – Michaël Roy Jan 14 '23 at 19:01
  • 1
    The original had problems with things like UTF-8-encoded surrogate pairs (which gives us the 6-byte sequences). That has always been an abomination, and is one of the main reasons the _current_ standard is updated. I think your answer should reflect the current standard, not something from the past, at least without making it clear that it is not currently standard. – Dúthomhas Jan 14 '23 at 19:38
  • It depends on whether the text you are checking could be from before 2004, which is not that long ago. I'll add a note. – Michaël Roy Jan 14 '23 at 19:41
-1

Recommendation: Just use ICU

It exists (that is, it is already installed and in use) on every major modern OS you care about[citation needed]. It’s there. Use it.

The good news is, for what you want to do, you don’t even have to link with ICU ⟶ No extra magic compilation flags necessary!

This should compile with anything (modern) you’ve got:

#include <string>

#ifdef _WIN32
  #include <icu.h>
#else
  #include <unicode/utf8.h>
#endif

bool is_utf8( const char * s, size_t n )
{
  if (!n) return true;  // empty files are UTF-8 encoded
  UChar32 c = 0;
  int32_t i = 0;
  do { U8_INTERNAL_NEXT_OR_SUB( s, i, (int32_t)n, c, 0 ); }
  while (c and U_IS_UNICODE_CHAR( c ) and (i < (int32_t)n));
  return !!c;
}

bool is_utf8( const std::string & s )
{
  return is_utf8( s.c_str(), s.size() );
}

If you are using MSVC’s C++17 or earlier, you’ll want to add an #include <ciso646> above that.

Example program:

#include <fstream>
#include <iostream>
#include <sstream>

auto file_to_string( const std::string & filename )
{
  std::ifstream f( filename, std::ios::binary );
  std::ostringstream ss;
  ss << f.rdbuf();
  return ss.str();
}

auto ask( const std::string & prompt )
{
  std::cout << prompt;
  std::string s;
  getline( std::cin, s );
  return s;
}

int main( int, char ** argv )
{
  std::string filename = argv[1] ? argv[1] : ask( "filename? " );
  std::cout << (is_utf8( file_to_string( filename ) )
    ? "UTF-8 encoded\n"
    : "Unknown encoding\n");
}

Tested with (Windows) MSVC, Clang/LLVM, MinGW-w64, TDM and (Linux) GCC, Clang over a whole bunch of random UTF-8 test files (valid and invalid) that I won’t offer you here.

  • cl /EHsc /W4 /Ox /std:c++17 isutf8.cpp
  • clang++ -Wall -Wextra -Werror -pedantic-errors -O3 -std=c++17 isutf8.cpp

(My copy of TDM is a little out of date. I also had to tell it where to find the ICU headers.)


Update

So, there is some interesting commentary about my claim to ICU’s ubiquity.

That's not how answers work. You're the one making the claim; you are therefore the one who must provide evidence for it.

Ah, but I am not making an extraordinary claim. But lest I get caught in a Shifting Burden of Proof circle, here’s my end of the easily-discovered stick. Clicky-clicky!

What this boils down to is that if you have a shiny window manager or <insert internet browser here> or basically any modern i18n text processing software program on your OS, there is a very high probability that it uses ICU (and things like HarfBuzz-icu)

My pleasure.

I find no pleasure here. Online compilers aren’t meant to compile anything beyond basic, text-I/O, single-file, simple programs. The fact that Godbolt’s online compiler can actually pull an include file from the web is, AFAIK, unique.

But while indeed cool, its limitations are acknowledged here — the ultimate consequence being that it would be absolutely impossible to compile something against ICU using godbolt.org or any other online compiler.

Which leads to a final note relevant to the code sample I gave above:

You need to properly configure you tools
if you expect them to work for you

For the above code snippet you must have ICU headers installed on your development machine. That is a given and should not surprise anyone. Just because your system has ICU libraries installed, and the software on it uses them, does not mean your compiler can automagically compile against the library.

  • For Windows you do automatically get the <icu.h> with the most recent WDKs (for some years now, and <icucommon.h> and <icui18n.h> before that).
  • For *nixen you will have to do something like sudo apt-get install libicu-dev or whatever is appropriate for your OS package manager.

I am glad I had to look into this, at least, because I just remembered that I have my development environments a little better initialized than the basic defaults, and was inadvertently using my local copy of ICU’s headers instead of Windows’. So I fixed it in the code above with that wonky #ifdef.

Must roll your own?

This is not difficult, and many people have different solutions to this, but there is a tricky consideration: a valid UTF-8 file should have valid Unicode code-points — which is more than just basic UTF-8/CESU-8/Modified-UTF-8/etc form validation.

If all you care is that the data is encoded using UTF-8 scheme, then Michaël Roy’s solution above looks fine to my eyeballs.

Personally, I think you should be a bit more strict, which properly requires you to actually decode the UTF-8 data to Unicode code points and verify them as well.

This requires very little more effort, and as it is something that your reader needs to do to access the data at some point anyway, why not just do it once and get it over with?

Still, here is just the check:

#include <algorithm>
#include <ciso646>
#include <string>

namespace utf8
{
  bool is_unicode( char32_t c )
  {
    return ((c & 0xFFFE) != 0xFFFE)
      and   (c < 0x10FFFF);
  }

  bool is_surrogate     ( char32_t c ) { return (c & 0xF800) == 0xD800; }
  bool is_high_surrogate( char32_t c ) { return (c & 0xFC00) == 0xD800; }
  bool is_low_surrogate ( char32_t c ) { return (c & 0xFC00) == 0xDC00; }

  char32_t decode( const char * & first, const char * last, char32_t invalid = 0xFFFD )
  {
    // Empty sequence
    if (first == last) return invalid;

    // decode byte length of encoded code point (1..4) from first octet
    static const unsigned char nbytes[] =
    {
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
    };
    unsigned char k, n = k = nbytes[(unsigned char)*first >> 3];
    if (!n) { ++first; return invalid; }

    // extract bits from lead octet
    static const unsigned char masks[] = { 0, 0x7F, 0x1F, 0x0F, 0x07 };
    char32_t c = (unsigned char)*first++ & masks[n];

    // extract bits from remaining octets
    while (--n and (first != last) and ((signed char)*first < -0x40))
      c = (c << 6) | ((unsigned char)*first++ & 0x3F);

    // the possibility of an incomplete sequence (continuing with future
    // input at a later invocation) is ignored here.
    if (n != 0) return invalid;

    // overlong-encoded sequences are not valid
    if (k != 1 + (c > 0x7F) + (c > 0x7FF) + (c > 0xFFFF)) return invalid;

    // the end
    return is_unicode( c ) and !is_surrogate( c ) ? c : invalid;
  }

  bool is_utf8( const std::string & s )
  {
    return []( const char * first, const char * last )
    {
      if (first != last)
      {
        // ignore UTF-8 BOM
        if ((last-first) > 2)
          if (    ((unsigned char)first[0] == 0xEF) 
              and ((unsigned char)first[1] == 0xBF) 
              and ((unsigned char)first[2] == 0xBE) )
            first += 3;

        while (first != last)
          if (decode( first, last, 0x10FFFF ) == 0x10FFFF)
            return false;
      }
      return true;
    }
    ( s.c_str(), s.c_str()+s.size() );
  }

} // namespace utf8

using utf8::is_utf8;

The very same example program as above can be used to play with the new code. It behaves exactly the same as the ICU code.

Variants

I have ignored some common UTF-8 variants. In particular:

  • CESU-8 is a variation that happens when software working over UTF-16 forgets that surrogate pairs exist and encode them as two adjacent UTF-8 code sequences.

  • Modified UTF-8 is a special encoding where '\0' is expressly encoded with the overlong sequence C0 80, which makes nul-terminated strings continue to work. Strict UTF-8 requires encoders to use as few octets as possible, but we could accept this one specific overlong sequence anyway.

    We will not, however, accept 5- or 6-octet sequences. The current Unicode UTF-8 standard, which is twenty years old now (2003), emphatically forbids them.

    Modified UTF-8 implies CESU-8.

  • WTF-8 happens, too. WTF-8 implies Modified UTF-8.

  • PEP 383 can go die in a lonely corner.

You may wish to consider these as valid. While the Unicode people think that those things shouldn’t appear in files you may have access to they do recognize that it is possible and not necessarily wrong. It wouldn’t take much to modify the code to enable checks for each of those. Let us know if that is what you are looking to do.


Simple, quick-n-dirty solutions look cute on the internet, but messing with text has corner cases and considerations that people on discussion forums like to forget — Which is the main reason I do not recommend doing this yourself. Use ICU. It is a highly-optimized, industry-proven library designed by people who eat and breathe this stuff. Everyone else is just hoping they get it right while the software that actually needs it just uses ICU.

Even the C++ Standard Library got it wrong, which is why the whole thing was deprecated. (std::codecvt_utf8 may or may not accept any of CESU-8, Modified UTF-8, and WTF-8, and its behavior is not consistent across platforms in that regard. That and its design means you must make more than a couple passes over your data to verify it, in contrast to the single-pass-cum-verify that ICU [and my code] does. Maybe not much of an issue in today’s highly-optimized memory pipelines, but still, I’m an old fart about this.)

Dúthomhas
  • 8,200
  • 2
  • 17
  • 39
  • 1
    "on every major modern OS you care about" Citation needed. – n. m. could be an AI Jan 14 '23 at 19:57
  • Prove me wrong. – Dúthomhas Jan 14 '23 at 20:00
  • @Dúthomhas: "*Prove me wrong.*" That's not how answers work. You're the one making the claim; you are therefore the one who must provide evidence for it. – Nicol Bolas Jan 14 '23 at 20:19
  • [My pleasure](https://godbolt.org/z/Pfz7sGY4W). – n. m. could be an AI Jan 14 '23 at 20:45
  • "It will probably just work on the system" is not a feasible approach for a software that is deployed to a multitude of customers' systems, which are not under my control. Thanks for the ICU hint (I know ICU of course), that's why I stated in my original question that I want a solution without any external libraries. With external libraries, this becomes easy indeed. – Lukas Barth Jan 15 '23 at 17:47
  • ICU perfectly targets the OS & compiler systems you list. Also, what I said was that _you can probably **compile** on your system. I did _not_ say “_it will probably just work on [every] system_”. It _will_ work on major systems (unless the user has gone out of their way to prevent it — it is possible to install some Linux distros with bare-bones nothing-but-a-shell-prompt-and-basic-OS-services. Your customers would have to be in some pretty specialized corners to need handling a UTF-8 file on a system without ICU, though). And, frankly, managing package dependencies is basic to distribution. – Dúthomhas Jan 15 '23 at 21:38
  • And, friendly reminder: forums like this are for everyone. You are not the only recipient for this answer. You are totally free to accept or reject it for any reason you like. I do hope you learn something useful, though, which is: don’t try to reinvent the wheel for a core function involving text processing. Again, I do not know your needs or requirements outside of the question. Do what’s best for you and your customers. ;-) – Dúthomhas Jan 15 '23 at 21:42