This should return what you asked. It will check for both the absence of unpaired high/low surrogate and for absence of non-defined codepoints (were "defined" depends on the unicode tables present in the version of .NET you are using and on the version of operating system)
static bool IsLegalUnicode(string str)
{
for (int i = 0; i < str.Length; i++)
{
var uc = char.GetUnicodeCategory(str, i);
if (uc == UnicodeCategory.Surrogate)
{
// Unpaired surrogate, like ""[0] + "A" or ""[1] + "A"
return false;
}
else if (uc == UnicodeCategory.OtherNotAssigned)
{
// \uF000 or \U00030000
return false;
}
// Correct high-low surrogate, we must skip the low surrogate
// (it is correct because otherwise it would have been a
// UnicodeCategory.Surrogate)
if (char.IsHighSurrogate(str, i))
{
i++;
}
}
return true;
}
Note that Unicode is in continuous expansion. UTF-8 is able to map all the Unicode codepoints, even the ones that can't be assigned at this time.
Some examples:
var test1 = IsLegalUnicode("abcdeàèéìòù"); // true
var test2 = IsLegalUnicode("⭐ White Medium Star"); // true, Unicode 5.1
var test3 = IsLegalUnicode(" Beaming Face With Smiling Eyes"); // true, Unicode 6.0
var test4 = IsLegalUnicode(" Slightly Smiling Face"); // true, Unicode 7.0
var test5 = IsLegalUnicode(" Hugging Face"); // true, Unicode 8.0
var test6 = IsLegalUnicode(" Rolling on the Floor Laughing"); // false, Unicode 9.0 (2016)
var test7 = IsLegalUnicode(" Star-Struck"); // false, Unicode 10.0 (2017)
var test8 = IsLegalUnicode("\uFF00"); // false, undefined BMP UTF-16 unicode
var test9 = IsLegalUnicode(""[0] + "X"); // false, unpaired high surrogate pair
var test10 = IsLegalUnicode(""[1] + "X"); // false, unpaired low surrogate pair
Note that you can encode in UTF-8 even well-formed "unknown" Unicode codepoints, like the Star-Struck
.
Results taken with .NET 4.7.2 under Windows 10.