4

I have a text in which there are latin letters and japanese characters (hiragana, katakana & kanji).

I want to filter out all latin characters, hiragana and katakana but I am not sure how to do this in an elegant way. My direct approach would be to just filter out every single letter of the latin alphabet in addition to every single hiragana/katakana but I am sure there is a better way.

I am guessing that I have to use regex but I am not quite sure how to go about it. Are letters somehow classified in roman letters, japanese, chinese etc. If yes, could I somehow use this?

Here some sample text:

"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"

The program should only return the kanjis (chinese character) like this:

`私、人,方,皆`
alpenmilch411
  • 483
  • 1
  • 5
  • 18
  • 1
    How about using Unicode range of non-kanji characters? http://stackoverflow.com/questions/19899554/unicode-range-for-japanese – Yasuyuki Uno Oct 26 '15 at 14:27

1 Answers1

5

I found the answer thanks to Olsgaarddk on reddit.

https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py

# -*- coding: utf-8 -*-
import re

''' This is a library of functions and variables that are helpful to have handy 
    when manipulating Japanese text in python.
    This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
    Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
    All rights reserved.
    Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''




## UNICODE BLOCKS ##

# Regular expression unicode blocks collected from 
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/

hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'

## FUNCTIONS ##

def extract_unicode_block(unicode_block, string):
    ''' extracts and returns all texts from a unicode block from string argument.
        Note that you must use the unicode blocks defined above, or patterns of similar form '''
    return re.findall( unicode_block, string)

def remove_unicode_block(unicode_block, string):
    ''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
        Note that you must use the unicode blocks defined above, or patterns of similar form '''
    return re.sub( unicode_block, '', string)

## EXAMPLES ## 

text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'

print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))
alpenmilch411
  • 483
  • 1
  • 5
  • 18
  • I found this really useful. I'm working in kotlin so created a gist of this in Kotlin at https://gist.github.com/bebop-001/c37ae92b8dec0328508047484af6fa47. – steven smith Jun 03 '20 at 00:01