Check if a large number is a valid Unicode character

Question

I'm looking to check if a large number is a valid Unicode character. I looked into the Char.IsSymbol(char) function, but it requires a char as input. What I need is the equivalent of Char.IsSymbol(int). For example: Char.IsSymbol(340813);

`Char.IsSymbol(Convert.ToChar(340813))` however 340813 is not a valid unicode character. — Longoon12000, Nov 22 '19 at 13:41
I tried it. I get an overflow error. Value too big (the value I used is 340736). — user2729463, Nov 22 '19 at 13:47

phuclv · Answer 1 · 2019-11-22T14:11:42.540

5

char is a 16-bit type in C#, representing a UTF-16 code unit, therefore the maximum value it can store is 65535 and Char.IsSymbol(340813) doesn't work.

To check if a code point is a symbol or not you must convert the code point to a string and call the IsSymbol(String, Int32) overload. To get the string use Char.ConvertFromUtf32(Int32) which "Converts the specified Unicode code point into a UTF-16 encoded string."

int codepoint = 340813;
string character = Char.ConvertFromUtf32(codepoint);
return IsSymbol(character, 0);

To check if a code point is valid it's even easier, because the maximum value of Unicode characters is 0x10FFFF. For the reason read Why Unicode is restricted to 0x10FFFF?

That means you just need a simple if (codepoint <= 0x10FFFF) although you may need to exclude the surrogate range 0xD800–0xDFFF because it's not valid values for single characters. So that results in

bool isValidUnicodeCharacter = codepoint <= 0x10FFFF && 
                               (codepoint < 0xD800 || codepoint > 0xDFFF)

You may want to check if the code point is valid or not before passing to Char.ConvertFromUtf32(); to avoid exceptions if your string contains a lot of invalid characters

edited Nov 22 '19 at 14:11

answered Nov 22 '19 at 14:02

phuclv

37,963
15
156
475

2

I don't think the OP is trying to check if the character code is a symbol (in the Unicode meaning of the term), but rather is trying to determine if the character code is a valid Unicode character - although I could of course be wrong. – Matthew Watson Nov 22 '19 at 14:06
1

@MatthewWatson I thought that also and wrote a part about check valid code points then deleting it before posting after I re-read the question – phuclv Nov 22 '19 at 14:09
2

The `Char.ConvertFromUtf32(Int32)` can also be used to check for the validity if that is the case, and so would still be better than what I suggested (but will still suffer from the same problem of the overhead of an exception being thrown). – Matthew Watson Nov 22 '19 at 14:11
2

Anyway this is clearly a better answer, so I've deleted mine. :) – Matthew Watson Nov 22 '19 at 14:14
1

Hi and thank you all for your help. I've tried the suggested soluton, but I'm getting a strange result. I'm trying the code with a value of 21152 and string ch = Char.ConvertFromUtf32(num); returns a Chinese character, which would denote that it is a valid UNICODE symbol, yet the function Char.IsSymbol(ch, 0) returns false. – user2729463 Nov 22 '19 at 14:25
1

@user2729463 [U+52A0 belongs to and represents a Han character](https://www.compart.com/en/unicode/U+52A0) so obviously it's not a **symbol** and the function works correctly. `Char.IsSymbol` *Indicates whether a Unicode character is categorized as a symbol character*. Each Unicode character has multiple properties like `isDigit`, `isLetter`, `isUnicodeIdentifierPart`... and `isSymbol` is one of them. If you want to check whether it's a valid code point then I already showed you above – phuclv Nov 22 '19 at 14:34
Ok, so I eliminate the part where I use IsSymbol, because what I'm looking for is a valid UNICODE character and not just a symbol. When I try to convert values like \u0ef0 and \udc40, no character appears, so my string looks something like "㌀\0\u0ef0耀Ѐ\0\0\0\0\0ࠀઠ嗐藀檀圀帠哰桐鯠\0䔀婐\udc40"...and so forth. When I try saving this string to file, I get an exception telling me that I cannot convert \udc40. How do I go about identifying these invalid values before inserting them in the string? – user2729463 Nov 22 '19 at 14:47
@user2729463 `\udc40` is in the surrogate range and isn't a valid character as I said above. My suggested way above will work correctly for it. However it looks like you've read the string incorrectly since there are a lot of bogus characters and zero bytes. Without code I can't tell you more, but I think it's for another question – phuclv Nov 22 '19 at 14:57
All I'm doing is reading keyboard input, concatenating it in a string and outputting it to file. The 0's are meant to be there. I'm simply wondering why some characters won't get decoded and need to somehow intercept them before they get concatenated like that. – user2729463 Nov 22 '19 at 15:05
[U+0EF0](https://www.compart.com/en/unicode/U+0EF0) is also undefined and is not a valid character – phuclv Nov 22 '19 at 15:17
Is there a way (other than the one indicated above) of checking for these invalid characters? – user2729463 Nov 22 '19 at 15:20
@user2729463 you're thinking of the wrong way to do that. Looping over the string and check if each character is valid or not and strip the invalid ones doesn't work, since there are surrogate pairs which can't be checked separately, and it's extremely inefficient. Use `Encoding.Unicode` with the appropriate fallbacks to remove the invalid characters, but even then I suspect the error lies in the input function because there's no way you can enter such invalid characters from keyboard unless you escape special characters somehow – phuclv Nov 22 '19 at 15:47

Check if a large number is a valid Unicode character

1 Answers1