19

I am learning about Regular expressions (regex) for English and although some of the concepts seem like they would apply to other languages such as Japanese, I feel as if many others would not. For example, a common use of regex is to find if a word has non alphanumeric characters. I don't see how this technique as well as others would work for Japanese as there are not only three writing systems, but kanji are also very complex and span a much greater range than alpha numeric characters do. I would appreciate any information on this topic as well as areas to look into more as I have very little knowledge on the subject although I have taken many Japanese courses. If at all possible, I would like your answers to use python and Java as those are the languages I am comfortable with. Thank you for your help.

djechlin
  • 59,258
  • 35
  • 162
  • 290
Something Jones
  • 315
  • 1
  • 2
  • 7
  • Most regex implementations support Unicode. What kind of regexes to write is a separate question. – SLaks May 30 '12 at 02:19
  • @Something Jones: You can apply Japanese characters by using hex value of Unicode. e.g: \uXXXX in which XXXX is the value of Unicode character. – jaselg May 30 '12 at 02:23
  • Something that may help: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml . Note that it doesn't separate Chinese glyph from Japanese glyph of the same character. – nhahtdh May 30 '12 at 02:28

4 Answers4

18

Python regexes offer limited support for Unicode features. Java is better, particularly Java 7.

Java supports Unicode categories. E.g., \p{L} (and its shorthand, \pL) matches any letter in any language. This includes Japanese ideographic characters.

Java 7 supports Unicode scripts, including the Hiragana, Katakana, Han, and Latin scripts that Japanese text is typically composed of. You can match any character in one of these scripts using \p{Han}, \p{Hiragana}, \p{Katakana}, and \p{Latin}. You can combine them in a character class such as [\p{Han}\p{Hiragana}\p{Katakana}]. You can use an uppercase P (as in, \P{Han}) to match any character except those in the Han script.

Java 7 supports Unicode blocks. Unless running your code in Android (where scripts are not available), you should generally avoid blocks, since they are less useful and accurate than Unicode scripts. There are a variety of blocks related to Japanese text, including \p{InHiragana}, \p{InKatakana}, \p{InCJK_Unified_Ideographs}, \p{InCJK_Symbols_and_Punctuation}, etc.

Both Java and Python can refer to individual code points using \uFFFF, where FFFF is any four-digit headecimal number. Java 7 can refer to any Unicode code point, including those beyond the Basic Multilingual Plane, using e.g. \x{10FFFF}. Python regexes don't support 21-bit Unicode, but Python strings do, so you can embed a a code point in a regex using e.g. \U0010FFFF (uppercase U followed by eight hex digits).

The Java 7 (?U) or UNICODE_CHARACTER_CLASS flag makes character class shorthands like \w and \d Unicode aware, so they will match Japanese ideographic characters, etc. (but note that \d will still not match kanji for numbers like 一二三四). Python 3 makes shorthand classes Unicode aware by default. In Python 2, shorthand classes are Unicode aware when you use the re.UNICODE or re.U flag.

You're right that not all regex ideas carry over equally well to all scripts. Some things (such as letter casing) just don't make sense with Japanese text.

JM Lord
  • 1,031
  • 1
  • 13
  • 29
slevithan
  • 1,394
  • 13
  • 20
  • 1
    The second paragraph is very likely to be incorrect according to the Java 7 documentation. **Have you tested?** The rest I cannot test now. – nhahtdh May 30 '12 at 05:07
  • Python version 3 regex supports Unicode more or less the same way Java 7 does for Pattern class: http://docs.python.org/py3k/library/re.html – nhahtdh May 30 '12 at 05:12
  • @nhahtdh, your comments are not helpful. Yes, Java supports Unicode scripts (described in my second paragraph), and doesn't require the "Is" prefix to use them (I recommend not using "Is", since that makes it less portable). And no, Python 3 does not have nearly the same support for Unicode features in its regex syntax as does Java (no Unicode scripts, categories, blocks, etc.). – slevithan May 30 '12 at 05:25
  • 1
    I am not sure how "less portable" it is, scripts are only supported in Java 7, not Java 6. And the documentation says that script can be specified by `\p{IsHiragana}`, `\p{script=Hiragana}`, or `\p{sc=Hiragana}`, but it never says anything about `p{Hiragana}`. I'm just commenting to make sure. For Python, although the set of features is not as rich as Java, it can match all word characters in any language using `\w`, according to the documentation at least. – nhahtdh May 30 '12 at 05:34
  • It is less portable with other regular expression flavors. All regular expression flavors that support Unicode scripts allow specifying them without a prefix, but not all flavors support the various prefixes. There is even some confusion about what the "Is" prefix does. E.g., .NET uses it for Unicode blocks, not scripts (unlike Java). – slevithan May 30 '12 at 06:59
  • Regular expression portability is the least thing I'm worrying about, especially since there is no standard for regex. That is except for a few basic similarity that we can observe across languages. – nhahtdh May 30 '12 at 07:07
  • Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character property name {Katakana} near index 16 – 2Big2BeSmall Jul 10 '16 at 14:16
11

For Python

#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
 
kanji = u'漢字'
hiragana = u'ひらがな'
katakana = u'カタカナ'
text = kanji + hiragana + katakana

#Match Kanji
regex = u'[\u4E00-\u9FFF]+' # == u'[一-龠々]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=> 漢字

#Match Hiragana
regex = u'[\u3040-\u309Fー]+' # == u'[ぁ-んー]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=> ひらがな

#Match Katakana
regex = u'[\u30A0-\u30FF]+' # == u'[ァ-ヾ]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=>カタカナ
oxfn
  • 6,590
  • 2
  • 26
  • 34
akazah
  • 371
  • 1
  • 4
  • 11
3

The Java character classes do something like what you are looking for. They are the ones that start with \p here.

John Watts
  • 8,717
  • 1
  • 31
  • 35
  • @nhahtdh: Please ask a question about something in particular to try. – Asherah May 30 '12 at 07:04
  • I just found out why I get false when I run this code: http://ideone.com/p3P9b on my computer. Compilation command should be `javac -encoding utf8 ` for it to work correctly. – nhahtdh May 30 '12 at 07:22
0

In Unicode there are two ways to classify characters from different writing systems. They are

  • Unicode Script (all characters used in a script, regardless of Unicode code points - may come from different blocks)
  • Unicode Block (code point ranges used for a specific purpose/script - may span across scripts and scripts may span across blocks)

The differences between these are explained rather more clearly on this web page from the official Unicode website.

In terms of matching characters in regular expressions in Java, you can use either classification mechanism since Java 7.

This is the syntax, as indicated in this tutorial from the Oracle website:

Script:

either \p{IsHiragana} or \p{script=Hiragana}

Block:

either \p{InHiragana} or \p{block=Hiragana}

Note that in one case it's "Is", in the other it's "In".

The syntax \p{Hiragana} indicated in the accepted answer does not seem to be a valid option. I tried it just in case but can confirm that it did not work for me.

fedmest
  • 709
  • 5
  • 17