17

I need a way to check whether a string contains Japanese or Chinese text.

Currently I'm using this:

string.match(/[\u3400-\u9FBF]/);

but it does not work with this for example: ディアボリックラヴァーズ or バッテリー.

Could you help me with that?

Thanks

Penny Liu
  • 15,447
  • 5
  • 79
  • 98
Frank
  • 2,083
  • 8
  • 34
  • 52
  • If Japanese can be matched with `[一-龯]` and Chinese with `[\u4E00-\u9FFF\u3400-\u4DFF]`, try using `if (/[一-龯\u4E00-\u9FFF\u3400-\u4DFF]/.test(s)) { alert("Contains Japanese or Chinese chars!"); }` – Wiktor Stribiżew Apr 14 '17 at 20:36
  • @WiktorStribiżew No, that's incorrect. Japanese includes characters outside the CJK range. –  Apr 14 '17 at 20:38
  • Ok, replace the JA one with [`[\u3000-\u303F\u3040-\u309F\u30A0-\u30FF\uFF00-\uFFEF\u4E00-\u9FAF\u2605-\u2606\u2190-\u2195\u203B]`](https://regex101.com/r/a5z6kc/1). – Wiktor Stribiżew Apr 14 '17 at 20:40
  • That's even weirder… some of the characters you're including, like U+2605 and U+2606, have nothing to do with Chinese or Japanese at all. (They're ★ and ☆.) –  Apr 14 '17 at 20:42
  • @duskwuff: See [this resource](https://gist.github.com/ryanmcgrath/982242): *Non-Japanese punctuation/formatting characters commonly used in Japanese text*. Yeah, [`/[\u3000-\u303F\u3040-\u309F\u30A0-\u30FF\uFF00-\uFFEF\u4E00-\u9FAF\u203B\u4E00-\u9FFF\u3400-\u4DFF]/`](https://regex101.com/r/a5z6kc/2) might be enough. – Wiktor Stribiżew Apr 14 '17 at 20:45
  • Or a bit [more complex regex with all possible Chinese chars](https://regex101.com/r/a5z6kc/4). – Wiktor Stribiżew Apr 14 '17 at 20:50

3 Answers3

31

The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:

  • U+3040 - U+30FF: hiragana and katakana (Japanese only)
  • U+3400 - U+4DBF: CJK unified ideographs extension A (Chinese, Japanese, and Korean)
  • U+4E00 - U+9FFF: CJK unified ideographs (Chinese, Japanese, and Korean)
  • U+F900 - U+FAFF: CJK compatibility ideographs (Chinese, Japanese, and Korean)
  • U+FF66 - U+FF9F: half-width katakana (Japanese only)

As a regular expression, this would be expressed as:

/[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]/

This does not include every character which will appear in Chinese and Japanese text, but any significant piece of typical Chinese or Japanese text will be mostly made up of characters from these ranges.

Note that this regular expression will also match on Korean text that contains hanja. This is an unavoidable result of Han unification.

  • 2
    To add Korean characters to the regex use the following: `\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f\u3131-\uD79D` – Paddy May 04 '20 at 08:43
4

swift 4, changed the pattern to and NSRegularExpression for replace, maybe might help someone!

[\u{3040}-\u{30ff}\u{3400}-\u{4dbf}\u{4e00}-\u{9fff}\u{f900}-\u{faff}\u{ff66}-\u{ff9f}]

extension method

mutating func removeRegexMatches(pattern: String, replaceWith: String = "") {
        do {
            let regex = try NSRegularExpression(pattern: pattern, options: NSRegularExpression.Options.caseInsensitive)
            let range = NSMakeRange(0, self.count)
            self = regex.stringByReplacingMatches(in: self, options: [], range: range, withTemplate: replaceWith)
        } catch {
            return
        }
    }

    mutating func removeEastAsianChars() {
        let regexPatternEastAsianCharacters = "[\u{3040}-\u{30ff}\u{3400}-\u{4dbf}\u{4e00}-\u{9fff}\u{f900}-\u{faff}\u{ff66}-\u{ff9f}]"
        removeRegexMatches(pattern: regexPatternEastAsianCharacters)
    }

example, string result is ABC

"ABC検診センター".removeEastAsianChars()
daviddna
  • 163
  • 2
  • 7
4

You can use this code and it's works for me.

let str = "渣打銀行提供一系列迎合你生活需要嘅信用卡";
//let str = "SGGRAND DING HOUSE 4GRAND DING HOUSE";
const REGEX_CHINESE = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]/;
const hasChinese = str.match(REGEX_CHINESE);
if(hasChinese){
  alert("Found");
}
else{
  alert("Not Found");
}
wpmarts
  • 532
  • 6
  • 8