How to check for invalid UTF-8 characters?

Question

Now there are lots of supported Hexadecimal (UTF-8) entities out there starting from Decimal values 0 to 10175, is there a fast way to check a certain value contained in a variable is one of the values of the supported Hexadecimal (UTF-8) entities.

e.x.

var something="some string value";
char[] validCharacter = new[] { All 10175 UTF-8 Hexadecimal characters };
if(validCharacter.Contains(something))
{ \\do something };

How can I do this check the fastest way possible?

Unclear what you are asking. `a` is utf-16, not utf-8. What do you mean with "invalid"? Unpaired high/low surrogates? Unassigned unicode codepoints? — xanatos, Jun 08 '18 at 12:59
@xanatos check the question now, `something` is just a random value and I want to check whether that value is one of the valid utf-8 codes or not.. — CD DelRio, Jun 08 '18 at 13:03
you are repeating the same words, but your words don't have a unique meaning. == `'\uD83D'+'\uDE35'` (so it is 2x `char` together), but alone both `'\uD83D'` and `'\uDE35'` (that are called high and low surrogates) are illegal. `'\uFFF0'` is **at this time** undefined in the Unicode standard (there is no character defined for that codepoint). We don't know if in a year it will still be undefined. Two different "illegal". — xanatos, Jun 08 '18 at 13:07
Problem 1 (unpaired surrogates) can be mechanically detected (it is based on the value of a char). Problem 2 (which characters are defined in Unicode) requires big tables of Unicode characters. The ones in .NET are old and don't contain newer emojis (and other rare scripts) — xanatos, Jun 08 '18 at 13:08
Do you want to know whether some integer value represents a valid unicode codepoint, or whether some byte could be used in a UTF-8 encoding, or (maybe) whether the current font will show something useful on-screen? — Hans Keﬆing, Jun 08 '18 at 13:08
@xanatos I want to check if the characters that are defined at this time is in the match, I'm not concerned about what will be added in the Unicode standard in future — CD DelRio, Jun 08 '18 at 13:11
@HansKeﬆing I want to know whether some string value which I extract from a document, represent a valid Unicode codepoint or not — CD DelRio, Jun 08 '18 at 13:12
Would this help? https://www.nuget.org/packages/UnicodeInformation/ — Hans Keﬆing, Jun 08 '18 at 13:15
And background info https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ — Hans Keﬆing, Jun 08 '18 at 13:16
@xanatos I think you can fix your answer by just testing the next character as well. Oh, Then you also have to test whether it's a low or high surrogate,... — ispiro, Jun 08 '18 at 13:36

xanatos · Answer 1 · 2018-06-08T13:33:27.227

This should return what you asked. It will check for both the absence of unpaired high/low surrogate and for absence of non-defined codepoints (were "defined" depends on the unicode tables present in the version of .NET you are using and on the version of operating system)

static bool IsLegalUnicode(string str)
{
    for (int i = 0; i < str.Length; i++)
    {
        var uc = char.GetUnicodeCategory(str, i);

        if (uc == UnicodeCategory.Surrogate)
        {
            // Unpaired surrogate, like  ""[0] + "A" or  ""[1] + "A"
            return false;
        }
        else if (uc == UnicodeCategory.OtherNotAssigned)
        {
            // \uF000 or \U00030000
            return false;
        }

        // Correct high-low surrogate, we must skip the low surrogate
        // (it is correct because otherwise it would have been a 
        // UnicodeCategory.Surrogate)
        if (char.IsHighSurrogate(str, i))
        {
            i++;
        }
    }

    return true;
}

Note that Unicode is in continuous expansion. UTF-8 is able to map all the Unicode codepoints, even the ones that can't be assigned at this time.

Some examples:

var test1 = IsLegalUnicode("abcdeàèéìòù"); // true
var test2 = IsLegalUnicode("⭐ White Medium Star"); // true, Unicode 5.1
var test3 = IsLegalUnicode(" Beaming Face With Smiling Eyes"); // true, Unicode 6.0
var test4 = IsLegalUnicode(" Slightly Smiling Face"); // true, Unicode 7.0
var test5 = IsLegalUnicode(" Hugging Face"); // true, Unicode 8.0
var test6 = IsLegalUnicode(" Rolling on the Floor Laughing"); // false, Unicode 9.0 (2016)

var test7 = IsLegalUnicode(" Star-Struck"); // false, Unicode 10.0 (2017)

var test8 = IsLegalUnicode("\uFF00"); // false, undefined BMP UTF-16 unicode

var test9 = IsLegalUnicode(""[0] + "X"); // false, unpaired high surrogate pair
var test10 = IsLegalUnicode(""[1] + "X"); // false, unpaired low surrogate pair

Note that you can encode in UTF-8 even well-formed "unknown" Unicode codepoints, like the Star-Struck.

Results taken with .NET 4.7.2 under Windows 10.

What I acutally want is, if the value inside your `IsLegalUnicode` method contains more than one character, it should be false automatically and if the value is a single character then it should first check whether it is a number [0 to 9], or a alphabetical character [a to z] or punctuation [.,;: etc] and if it is none of them then the check should really work...I hope I made it a bit clearer..normal alphabetical characters, numbers and punctuations are excluded from the checking — CD DelRio, Jun 08 '18 at 13:55
@xanatos +1. I didn't know that `char.GetUnicodeCategory` works differently than `myString[i]` and takes both parts in case of a surrogate pair. — ispiro, Jun 08 '18 at 15:19

ispiro · Answer 2 · 2021-09-09T20:25:12.500

4

UTF8Encoding.GetString(byteArray) will throw an ArgumentException if Error detection is enabled.

Source: https://msdn.microsoft.com/en-us/library/kzb9f993(v=vs.110).aspx

But if you're testing something that is already a string - as far as I know - it will almost always be valid UTF8. (see below.) As far as I know all C# strings are encoded in UTF16 which is an encoding for all Unicode characters. UTF8 is just a different encoding for the same set. i.e. For all of the Unicode characters.

(This might excluded some Unicode characters which are new etc. But those will also not be in UTF16 so that won't matter here.)

As someone has commented, there might be "halves" of UTF16 characters that would be valid strings but won't be valid UTF8 values. So you can Encoding.Unicode.GetBytes() and then Encoding.UTF8.GetString() to verify. But those will probably be quite rare.

EDIT

Enabling error detection: Use this UTF8Encoding(Boolean, Boolean) constructor for UTF8Encoding.

edited Sep 09 '21 at 20:25

answered Jun 08 '18 at 13:08

ispiro

26,556
38
136
291

@xanatos Thanks. I actually think it's a little strange that C# allows those. Though I can understand the reasoning behind it... – ispiro Jun 08 '18 at 13:12
You can, however, losslessly convert all C# strings to [WTF-8](http://simonsapin.github.io/wtf-8/). – dan04 Jun 08 '18 at 13:21
@dan04 What about a half of a surrogate pair? I think it's a 'valid' string, though it's not really a Unicode character. – ispiro Jun 08 '18 at 13:26

How to check for invalid UTF-8 characters?

2 Answers2

Linked