Recommendation: Just use ICU
It exists (that is, it is already installed and in use) on every major modern OS you care about[citation needed]. It’s there. Use it.
The good news is, for what you want to do, you don’t even have to link with ICU ⟶ No extra magic compilation flags necessary!
This should compile with anything (modern) you’ve got:
#include <string>
#ifdef _WIN32
#include <icu.h>
#else
#include <unicode/utf8.h>
#endif
bool is_utf8( const char * s, size_t n )
{
if (!n) return true; // empty files are UTF-8 encoded
UChar32 c = 0;
int32_t i = 0;
do { U8_INTERNAL_NEXT_OR_SUB( s, i, (int32_t)n, c, 0 ); }
while (c and U_IS_UNICODE_CHAR( c ) and (i < (int32_t)n));
return !!c;
}
bool is_utf8( const std::string & s )
{
return is_utf8( s.c_str(), s.size() );
}
If you are using MSVC’s C++17 or earlier, you’ll want to add an #include <ciso646>
above that.
Example program:
#include <fstream>
#include <iostream>
#include <sstream>
auto file_to_string( const std::string & filename )
{
std::ifstream f( filename, std::ios::binary );
std::ostringstream ss;
ss << f.rdbuf();
return ss.str();
}
auto ask( const std::string & prompt )
{
std::cout << prompt;
std::string s;
getline( std::cin, s );
return s;
}
int main( int, char ** argv )
{
std::string filename = argv[1] ? argv[1] : ask( "filename? " );
std::cout << (is_utf8( file_to_string( filename ) )
? "UTF-8 encoded\n"
: "Unknown encoding\n");
}
Tested with (Windows) MSVC, Clang/LLVM, MinGW-w64, TDM and (Linux) GCC, Clang over a whole bunch of random UTF-8 test files (valid and invalid) that I won’t offer you here.
cl /EHsc /W4 /Ox /std:c++17 isutf8.cpp
clang++ -Wall -Wextra -Werror -pedantic-errors -O3 -std=c++17 isutf8.cpp
(My copy of TDM is a little out of date. I also had to tell it where to find the ICU headers.)
Update
So, there is some interesting commentary about my claim to ICU’s ubiquity.
That's not how answers work. You're the one making the claim; you are therefore the one who must provide evidence for it.
Ah, but I am not making an extraordinary claim. But lest I get caught in a Shifting Burden of Proof circle, here’s my end of the easily-discovered stick. Clicky-clicky!
What this boils down to is that if you have a shiny window manager or <insert internet browser here> or basically any modern i18n text processing software program on your OS, there is a very high probability that it uses ICU (and things like HarfBuzz-icu)
My pleasure.
I find no pleasure here. Online compilers aren’t meant to compile anything beyond basic, text-I/O, single-file, simple programs. The fact that Godbolt’s online compiler can actually pull an include file from the web is, AFAIK, unique.
But while indeed cool, its limitations are acknowledged here — the ultimate consequence being that it would be absolutely impossible to compile something against ICU using godbolt.org or any other online compiler.
Which leads to a final note relevant to the code sample I gave above:
You need to properly configure you tools
if you expect them to work for you
For the above code snippet you must have ICU headers installed on your development machine. That is a given and should not surprise anyone. Just because your system has ICU libraries installed, and the software on it uses them, does not mean your compiler can automagically compile against the library.
- For Windows you do automatically get the
<icu.h>
with the most recent WDKs (for some years now, and <icucommon.h>
and <icui18n.h>
before that).
- For *nixen you will have to do something like
sudo apt-get install libicu-dev
or whatever is appropriate for your OS package manager.
I am glad I had to look into this, at least, because I just remembered that I have my development environments a little better initialized than the basic defaults, and was inadvertently using my local copy of ICU’s headers instead of Windows’. So I fixed it in the code above with that wonky #ifdef
.
Must roll your own?
This is not difficult, and many people have different solutions to this, but there is a tricky consideration: a valid UTF-8 file should have valid Unicode code-points — which is more than just basic UTF-8/CESU-8/Modified-UTF-8/etc form validation.
If all you care is that the data is encoded using UTF-8 scheme, then Michaël Roy’s solution above looks fine to my eyeballs.
Personally, I think you should be a bit more strict, which properly requires you to actually decode the UTF-8 data to Unicode code points and verify them as well.
This requires very little more effort, and as it is something that your reader needs to do to access the data at some point anyway, why not just do it once and get it over with?
Still, here is just the check:
#include <algorithm>
#include <ciso646>
#include <string>
namespace utf8
{
bool is_unicode( char32_t c )
{
return ((c & 0xFFFE) != 0xFFFE)
and (c < 0x10FFFF);
}
bool is_surrogate ( char32_t c ) { return (c & 0xF800) == 0xD800; }
bool is_high_surrogate( char32_t c ) { return (c & 0xFC00) == 0xD800; }
bool is_low_surrogate ( char32_t c ) { return (c & 0xFC00) == 0xDC00; }
char32_t decode( const char * & first, const char * last, char32_t invalid = 0xFFFD )
{
// Empty sequence
if (first == last) return invalid;
// decode byte length of encoded code point (1..4) from first octet
static const unsigned char nbytes[] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
};
unsigned char k, n = k = nbytes[(unsigned char)*first >> 3];
if (!n) { ++first; return invalid; }
// extract bits from lead octet
static const unsigned char masks[] = { 0, 0x7F, 0x1F, 0x0F, 0x07 };
char32_t c = (unsigned char)*first++ & masks[n];
// extract bits from remaining octets
while (--n and (first != last) and ((signed char)*first < -0x40))
c = (c << 6) | ((unsigned char)*first++ & 0x3F);
// the possibility of an incomplete sequence (continuing with future
// input at a later invocation) is ignored here.
if (n != 0) return invalid;
// overlong-encoded sequences are not valid
if (k != 1 + (c > 0x7F) + (c > 0x7FF) + (c > 0xFFFF)) return invalid;
// the end
return is_unicode( c ) and !is_surrogate( c ) ? c : invalid;
}
bool is_utf8( const std::string & s )
{
return []( const char * first, const char * last )
{
if (first != last)
{
// ignore UTF-8 BOM
if ((last-first) > 2)
if ( ((unsigned char)first[0] == 0xEF)
and ((unsigned char)first[1] == 0xBF)
and ((unsigned char)first[2] == 0xBE) )
first += 3;
while (first != last)
if (decode( first, last, 0x10FFFF ) == 0x10FFFF)
return false;
}
return true;
}
( s.c_str(), s.c_str()+s.size() );
}
} // namespace utf8
using utf8::is_utf8;
The very same example program as above can be used to play with the new code.
It behaves exactly the same as the ICU code.
Variants
I have ignored some common UTF-8 variants. In particular:
CESU-8 is a variation that happens when software working over UTF-16 forgets that surrogate pairs exist and encode them as two adjacent UTF-8 code sequences.
Modified UTF-8 is a special encoding where '\0'
is expressly encoded with the overlong sequence C0
80
, which makes nul-terminated strings continue to work. Strict UTF-8 requires encoders to use as few octets as possible, but we could accept this one specific overlong sequence anyway.
We will not, however, accept 5- or 6-octet sequences. The current Unicode UTF-8 standard, which is twenty years old now (2003), emphatically forbids them.
Modified UTF-8 implies CESU-8.
WTF-8 happens, too. WTF-8 implies Modified UTF-8.
PEP 383 can go die in a lonely corner.
You may wish to consider these as valid. While the Unicode people think that those things shouldn’t appear in files you may have access to they do recognize that it is possible and not necessarily wrong. It wouldn’t take much to modify the code to enable checks for each of those. Let us know if that is what you are looking to do.
Simple, quick-n-dirty solutions look cute on the internet, but messing with text has corner cases and considerations that people on discussion forums like to forget — Which is the main reason I do not recommend doing this yourself. Use ICU. It is a highly-optimized, industry-proven library designed by people who eat and breathe this stuff. Everyone else is just hoping they get it right while the software that actually needs it just uses ICU.
Even the C++ Standard Library got it wrong, which is why the whole thing was deprecated. (std::codecvt_utf8
may or may not accept any of CESU-8, Modified UTF-8, and WTF-8, and its behavior is not consistent across platforms in that regard. That and its design means you must make more than a couple passes over your data to verify it, in contrast to the single-pass-cum-verify that ICU [and my code] does. Maybe not much of an issue in today’s highly-optimized memory pipelines, but still, I’m an old fart about this.)