How to recognize if a string contains unicode chars?

Question

I have a string and I want to know if it has unicode characters inside or not. (if its fully contains ASCII or not)

How can I achieve that?

Thanks!

I think you need to tell us more, since all strings in .NET are unicode. Are you afraid you're going to lose some characters in an encoding process? If so, please tell us what you intend to use the knowledge for. — Lasse V. Karlsen, Dec 16 '10 at 10:16
I want to know if something complies with ASCII or not... (fully comply) — Himberjack, Dec 16 '10 at 10:28
use a regex- this would be a related question A regex can be used to replace or to match. The following answer is about replacing, but you can use a regex for matching too http://stackoverflow.com/questions/7411438/remove-characters-from-c-sharp-string — barlop, Apr 20 '17 at 17:57

Tim Lloyd · Accepted Answer · 2013-05-16T17:39:22.323

If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.

    public void test()
    {
        const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
        const string WithoutUnicodeCharacter = "an ANSI character:Æ";

        bool hasUnicode;

        //true
        hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
        Console.WriteLine(hasUnicode);

        //false
        hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
        Console.WriteLine(hasUnicode);
    }

    public bool ContainsUnicodeCharacter(string input)
    {
        const int MaxAnsiCode = 255;

        return input.Any(c => c > MaxAnsiCode);
    }

Update

This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.

This is incorrect. A C# char is a unicode UTF-16 character. Only up to 127 are the characters the same as in ASCII. The ASCII extended range will be different depending on the locale used, i.e. ANSI not Extended ASCII. So for English ISO-8859-1 the characters will match UTF-16 but they won't be the same characters in other locales. See the comparison table here: https://en.wikipedia.org/wiki/ISO/IEC_8859. — kjbartel, Jun 18 '19 at 00:33

score 15 · Answer 2 · edited Mar 12 '18 at 19:45

15

If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string so a one liner check in c# could look like..

String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;

edited Mar 12 '18 at 19:45

freedomn-m

27,664
8
35
57

answered Aug 22 '17 at 20:17

zingh

404
4
11

It does not work for say russian test: `System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes("фы")) != "фы"` returns False. – Anton Krouglov Dec 05 '18 at 10:36
i tested your exact statement in a console application and it returns True for me. – zingh Dec 07 '18 at 18:39
I have tested this in linqPad - it returns false. – Anton Krouglov Dec 07 '18 at 18:46

score 6 · Answer 3 · answered Dec 16 '10 at 10:58

ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.

Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.

ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).

As for the actual code to do this, @chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.

[*] Also known as Latin 1 Windows (Win-1252)

score 1 · Answer 4 · answered Oct 26 '16 at 03:01

1

This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:

   Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
        Dim inputCharArray() As Char = inputstr.ToCharArray

        For i As Integer = 0 To inputCharArray.Length - 1
            If CInt(AscW(inputCharArray(i))) > 255 Then Return True
        Next
        Return False
   End Function

answered Oct 26 '16 at 03:01

Yiannis Mpourkelis

1,366
1
15
34

There are only 128 characters in ASCII, so the `> 255` does not appear to be correct. – Zero3 Oct 10 '18 at 22:36
1

There are 256 characters including the extended ascii character codes based on this table https://www.ascii-code.com – Yiannis Mpourkelis Nov 28 '18 at 19:48

score 1 · Answer 5 · edited May 23 '17 at 12:10

As long as it contains characters, it contains Unicode characters.

From System.String:

Represents text as a series of Unicode characters.

public static bool ContainsUnicodeChars(string text)
{
   return !string.IsNullOrEmpty(text);
}

You normally have to worry about different Unicode encodings when you have to:

Encode a string into a stream of bytes with a particular encoding.
Decode a string from a stream of bytes with a particular encoding.

Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.

Each character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded by using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object.

Perhaps you might also find these questions relevant:

How can you strip non-ASCII characters from a string? (in C#)

C# Ensure string contains only ASCII

And this article by Jon Skeet: Unicode and .NET

Unicode is a superset of ASCII. The question is clearly about how to determine if the string only uses ASCII characters. So this answer seems unnecessarily pedantic to me... — Zero3, Oct 10 '18 at 22:27

How to recognize if a string contains unicode chars?

5 Answers5

Linked

Related