Is there a way to check whether unicode text is in a certain language?

Question

I'll be getting text from a user that I need to validate is a Chinese character.

Is there any way I can check this?

Checking if a string contains only valid Chinese characters is not the same as to say this is in Chinese. Obviously you can write nonsensical string of Chinese characters. But I'm also thinking about language that shares a lot of characters with Chinese (Japanese, old Korean text). See CJK unification. My two cents. — Po' Lazarus, May 23 '11 at 14:00

score 20 · Answer 1 · edited Nov 17 '17 at 18:47

20

You can use regular expression to match with Supported Named Blocks:

private static readonly Regex cjkCharRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}");
public static bool IsChinese(this char c)
{
    return cjkCharRegex.IsMatch(c.ToString());
}

Then, you can use:

if (sometext.Any(z=>z.IsChinese()))
     DoSomething();

edited Nov 17 '17 at 18:47

hshib

1,691
1
17
22

answered Aug 05 '12 at 13:57

Tyler Liu

19,552
11
100
84

This should be marked as the correct answer. – MasterWil Aug 05 '21 at 12:50
This may not work for all characters and CJK languages especially in my case Korean so you may need to add additional unicode blocks to the regex expression. This [other post](https://stackoverflow.com/questions/16415074/detecting-cjk-characters-in-a-string-c) definitely covered my use case. – Nick Peppers Oct 01 '21 at 18:54

score 16 · Accepted Answer · edited Mar 10 '16 at 14:39

16

According to the information provided here in unicode website you can find the block of Chinese or any other language and then implement a parser to check if a word is in the range or no. just like

public bool IsChinese(string text)
{
    return text.Any(c => c >= 0x20000 && c <= 0xFA2D);
}

Note that

As a handy reference, the Unicode Consortium here provides a search interface to the Unicode Hàn (漢) Database (Unihan).

The database link I'd provided above is showing you the characters

edited Mar 10 '16 at 14:39

Servy

202,030
26
332
449

answered May 25 '11 at 10:49

Nasser Hadjloo

12,312
15
69
100

10

0x20000 is larger than 0xfa2d. How comes the condition c >= 0x20000 && c <= 0xFA2D ? – cxwangyi Nov 17 '14 at 01:45
1

This solution doesn't make sense, your Any statement is false for all possible characters. – rollsch Mar 30 '18 at 06:44
I found that this solution did not work for me, [this answer](https://stackoverflow.com/a/11817023/81411) seemed better. – jrsconfitto Sep 16 '19 at 18:44
1

This is WRONG! The working answer is below: https://stackoverflow.com/a/11817023/1830814 – Artemious Feb 11 '20 at 11:22

score 9 · Answer 3 · answered Feb 23 '17 at 09:39

9

As several people mentioned here, in unicode, chinese, japan, and Korean characters are encoded together, and there are several ranges to it. https://en.wikipedia.org/wiki/CJK_Compatibility

For the simplicity, here's a code sample that detects all the CJK range:

public bool IsChinese(string text)
{
    return text.Any(c => (uint)c >= 0x4E00 && (uint)c <= 0x2FA1F);
}

answered Feb 23 '17 at 09:39

Milana

557
9
20

Thanks! That worked for me with some Chinese characters that wouldn't pass in @nasser-hadjloo's answer. – Yoav Feuerstein Aug 31 '17 at 07:46
1

I don't think this is correct. The ranges in the Wikipedia page are not contiguous, as far as I can tell – mcont Feb 17 '19 at 17:23

score 3 · Answer 4 · edited May 23 '17 at 12:18

3

Just check the characters to see if the codepoints are in the desired range(s). For exampe, see this question:

What's the complete range for Chinese characters in Unicode?

edited May 23 '17 at 12:18

Community

1
1

answered May 22 '11 at 13:45

James Scriven

7,784
1
32
36

score 2 · Answer 5 · answered Apr 14 '16 at 13:24

According to the wikipedia (https://en.wikipedia.org/wiki/CJK_Compatibility) there are several character code diapasons. Here is my approach to detect Chinese characters based on link above (code in F#, but it can be easily converted)

 let isChinese(text: string) = 
            text |> Seq.exists (fun c -> 
                let code = int c
                (code >= 0x4E00 && code <= 0x9FFF) ||
                (code >= 0x3400 && code <= 0x4DBF) ||
                (code >= 0x3400 && code <= 0x4DBF) ||
                (code >= 0x20000 && code <= 0x2CEAF) ||
                (code >= 0x2E80 && code <= 0x31EF) ||
                (code >= 0xF900 && code <= 0xFAFF) ||
                (code >= 0xFE30 && code <= 0xFE4F) ||
                (code >= 0xF2800 && code <= 0x2FA1F) 
                )

score 2 · Answer 6 · answered May 11 '20 at 20:38

I found another way, using UnicodeRanges (more precisely UnicodeRanges.CjkUnifiedIdeographs), if someone is looking for :

public bool IsChinese(char character)
{
    var minValue = UnicodeRanges.CjkUnifiedIdeographs.FirstCodePoint;
    var maxValue = minValue + UnicodeRanges.CjkUnifiedIdeographs.Length;
    return (character >= minValue && character < maxValue);
}

score 1 · Answer 7 · answered May 08 '22 at 17:51

Added this for a project, it is incomplete and could be more optimized (by checking the code blocks in the right order), but it gets the job done well enough.

const CHINESE_UNICODE_BLOCKS = [
  [0x3400, 0x4DB5],
  [0x4E00, 0x62FF],
  [0x6300, 0x77FF],
  [0x7800, 0x8CFF],
  [0x8D00, 0x9FCC],
  [0x2e80, 0x2fd5],
  [0x3190, 0x319f],
  [0x3400, 0x4DBF],
  [0x4E00, 0x9FCC],
  [0xF900, 0xFAAD],
  [0x20000, 0x215FF],
  [0x21600, 0x230FF],
  [0x23100, 0x245FF],
  [0x24600, 0x260FF],
  [0x26100, 0x275FF],
  [0x27600, 0x290FF],
  [0x29100, 0x2A6DF],
  [0x2A700, 0x2B734],
  [0x2B740, 0x2B81D]
]

const JAPANESE_UNICODE_BLOCKS = [
  [0x3041, 0x3096],
  [0x30A0, 0x30FF],
  [0x3400, 0x4DB5],
  [0x4E00, 0x9FCB],
  [0xF900, 0xFA6A],
  [0x2E80, 0x2FD5],
  [0xFF5F, 0xFF9F],
  [0x3000, 0x303F],
  [0x31F0, 0x31FF],
  [0x3220, 0x3243],
  [0x3280, 0x337F],
  [0xFF01, 0xFF5E],
]

const LATIN_UNICODE_BLOCKS = [
  [0x0000, 0x007F],
  [0x0080, 0x00FF],
  [0x0100, 0x017F],
  [0x0180, 0x024F],
  [0x0250, 0x02AF],
  [0x02B0, 0x02FF],
  [0x1D00, 0x1D7F],
  [0x1D80, 0x1DBF],
  [0x1E00, 0x1EFF],
  [0x2070, 0x209F],
  [0x2100, 0x214F],
  [0x2150, 0x218F],
  [0x2C60, 0x2C7F],
  [0xA720, 0xA7FF],
  [0xAB30, 0xAB6F],
  [0xFB00, 0xFB4F],
  [0xFF00, 0xFFEF],
  [0x10780, 0x107BF],
  [0x1DF00, 0x1DFFF],
]

const DEVANAGARI_UNICODE_BLOCKS = [
  [0x0900, 0x097F]
]

const ARABIC_UNICODE_BLOCKS = [
  [0x0600, 0x06FF],
  [0x0750, 0x077F],
  [0x0870, 0x089F],
  [0x08A0, 0x08FF],
  [0xFB50, 0xFDFF],
  [0xFE70, 0xFEFF],
  [0x10E60, 0x10E7F],
  [0x1EC70, 0x1ECBF],
  [0x1ED00, 0x1ED4F],
  [0x1EE00, 0x1EEFF],
]

const TIBETAN_UNICODE_BLOCKS = [
  [0x0F00, 0x0FFF],
]

const GREEK_UNICODE_BLOCKS = [
  [0x0370, 0x03FF],
  [0x1D00, 0x1D7F],
  [0x1D80, 0x1DBF],
  [0x1F00, 0x1FFF],
  [0x2100, 0x214F],
  [0xAB30, 0xAB6F],
  [0x10140, 0x1018F],
  [0x10190, 0x101CF],
  [0x1D200, 0x1D24F],
]

const TAMIL_UNICODE_BLOCKS = [
  [0x0B80, 0x0BFF],
]

const CYRILLIC_UNICODE_BLOCKS = [
  [0x0400, 0x04FF],
  [0x0500, 0x052F],
  [0x2DE0, 0x2DFF],
  [0xA640, 0xA69F],
  [0x1C80, 0x1C8F],
  [0x1D2B, 0x1D78],
  [0xFE2E, 0xFE2F],
]

const HEBREW_UNICODE_BLOCKS = [
  [0x0590, 0x05FF],
]

function detectMostProminentLanguage(characters) {
  const possibilities = detectLanguageProbabilities(characters)

  let maxPair = [null, 0]
  let sum = 0

  Object.keys(possibilities).forEach(system => {
    const value = possibilities[system]

    if (maxPair[1] < value && system !== 'other') {
      sum += value
      maxPair[0] = system
      maxPair[1] = value
    }
  })

  return { system: maxPair[0], accuracy: maxPair[1] / sum }
}

function detectLanguageProbabilities(characters) {
  const possibilities = {}

  for (const character of characters) {
    if (isLatin(character)) {
      add(possibilities, 'latin')
    } else if (isChinese(character)) {
      add(possibilities, 'chinese')
    } else if (isJapanese(character)) {
      add(possibilities, 'japanese')
    } else if (isDevanagari(character)) {
      add(possibilities, 'devanagari')
    } else if (isHebrew(character)) {
      add(possibilities, 'hebrew')
    } else if (isTamil(character)) {
      add(possibilities, 'tamil')
    } else if (isGreek(character)) {
      add(possibilities, 'greek')
    } else if (isTibetan(character)) {
      add(possibilities, 'tibetan')
    } else if (isArabic(character)) {
      add(possibilities, 'arabic')
    } else if (isCyrillic(character)) {
      add(possibilities, 'cyrillic')
    } else {
      add(possibilities, 'other')
    }
  }

  return possibilities
}

function isHebrew(character) {
  return isWithinRange(HEBREW_UNICODE_BLOCKS, character)
}

function isCyrillic(character) {
  return isWithinRange(CYRILLIC_UNICODE_BLOCKS, character)
}

function isArabic(character) {
  return isWithinRange(ARABIC_UNICODE_BLOCKS, character)
}

function isTibetan(character) {
  return isWithinRange(TIBETAN_UNICODE_BLOCKS, character)
}

function isGreek(character) {
  return isWithinRange(GREEK_UNICODE_BLOCKS, character)
}

function isTamil(character) {
  return isWithinRange(TAMIL_UNICODE_BLOCKS, character)
}

function isDevanagari(character) {
  return isWithinRange(DEVANAGARI_UNICODE_BLOCKS, character)
}

function isJapanese(character) {
  return isWithinRange(JAPANESE_UNICODE_BLOCKS, character)
}

function isLatin(character) {
  return isWithinRange(LATIN_UNICODE_BLOCKS, character)
}

function isChinese(character) {
  return isWithinRange(CHINESE_UNICODE_BLOCKS, character)
}

function isWithinRange(blocks, character) {
  return blocks.some(([ start, end ]) => {
    const code = character.codePointAt(0)
    return code >= start && code <= end
  })
}

function add(possibilities, type) {
  possibilities[type] = possibilities[type] ?? 0
  possibilities[type]++
}

log('abc')
log('שָׁלוֹם')
log('美丽的')
log('ひらがな')
log('कल्पना')
log('قمر')
log('மின்னல்')
log('αποκάλυψη')
log('дружба')
log('རྣམ་ཤེས་')

function log(text) {
  const { system, accuracy } = detectMostProminentLanguage([...text])
  console.log(`${text} => ${system} (${accuracy})`)
}

score 0 · Answer 8 · answered May 05 '16 at 14:09

in unicode, chinese, japan, and Korean characters are encoded together.

visit this FAQ: http://www.unicode.org/faq/han_cjk.html

chinese character are distributed in serveral blocks.

visit this wiki: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

You will find there are serveral cjk character charts in unicode website.

For simplicity, You can just use chinese character minimum and maximum range:

0x4e00 and 0x2fa1f to check.

score 0 · Answer 9 · edited Nov 28 '16 at 11:00

0

This worked for me:

var charArray = text.ToCharArray();
var isChineseTextPresent = false;


foreach (var character in charArray)
{
    var cat = char.GetUnicodeCategory(character);


    if (cat != UnicodeCategory.OtherLetter)
    {
        continue;
    }


    isChineseTextPresent = true;
    break;
}

edited Nov 28 '16 at 11:00

Martin

2,411
11
28
30

answered Nov 28 '16 at 10:42

Akash Srivastava

1

score 0 · Answer 10 · answered Aug 18 '22 at 08:14

According to the information provided here in microsoft website you can find the block of Chinese or any other language and then implement a parser to check if a word is in the range or no.

just like:

public bool IsChinese(char character)
{
    return new[]
        {
            UnicodeRanges.CjkCompatibility,
            UnicodeRanges.CjkCompatibilityForms,
            UnicodeRanges.CjkCompatibilityIdeographs,
            UnicodeRanges.CjkRadicalsSupplement,
            UnicodeRanges.CjkStrokes,
            UnicodeRanges.CjkSymbolsandPunctuation,
            UnicodeRanges.CjkUnifiedIdeographs,
            UnicodeRanges.CjkUnifiedIdeographsExtensionA,
            UnicodeRanges.EnclosedCjkLettersandMonths
        }
        .Any(x => character >= x.FirstCodePoint && character < x.FirstCodePoint + x.Length);
}

score -1 · Answer 11 · answered May 22 '11 at 13:40

-1

You need to query the Unicode Character Database, that contain info on every unicode character. There probably is a utility function in C# that can do this for you. Otherwise you can download it off the internet.

answered May 22 '11 at 13:40

Dov Grobgeld

4,783
1
25
36

Is there a way to check whether unicode text is in a certain language?

11 Answers11

Linked

Related