I'm not familiar with the different sets of Japanese characters, but you should be able to identify Japanese characters based on their Unicode code points, which should lie within one of the following ranges:
- Hiragana: 3040-309f
- Katakana: 30a0-30ff
- Kanji: 4e00-9fbf
Note that different sources may also include other ranges, such as 1 or 2. The ones that I listed should definitely be included, but you should figure out which other ranges you also want to cover as well, and then extend the is_japanese_char
function shown below.
import re
def is_japanese_char(ch):
assert(len(ch) == 1) # only use this for single character strings
if re.search("[\u3040-\u309f]", ch):
return True # is hiragana
if re.search("[\u30a0-\u30ff]", ch):
return True # is katakana
if re.search("[\u4e00-\u9faf]", ch):
return True # is kanji
return False
Now that you can identify Japanese characters, you can iterate over each character in the string, and remove all unwanted characters, like this:
def is_bad_underscore(ch, prev_ch, next_ch):
if ch != "_":
return False
if not is_japanese_char(prev_ch):
return False
if not is_japanese_char(next_ch):
return False
return True
def remove_bad_underscores(s):
new_string = s[0]
for i, ch in enumerate(s[1:-1], start=1): # skip first and last
if not is_bad_underscore(ch, s[i-1], s[i+1]):
new_string += ch
return new_string + s[-1]
It's not the cleanest code, and can be optimized, but it works.
print(remove_bad_underscores("3F_う_が_LOW_まい_が") == "3F_うが_LOW_まいが") # True
print(remove_bad_underscores("A5_BB_合_ら") == "A5_BB_合ら") # True
print(remove_bad_underscores("C1_だ_と_思") == "C1_だと思") # True