Delete underscores between japanese characters from a string in python

Question

i need help with deleting underscores from certains strings. That's not difficult, the difficulty comes from the fact that the string does contain japanese characters.

E.g. i have these strings (of hundred of thousands of other strings):

str1 = "3F_う_が_LOW_まい_が"
str2 = "A5_BB_合_ら"
str3 = "C1_だ_と_思"

What i want to get as a final result is this:

strFinal1 = "3F_うが_LOW_まいが"
strFinal2 = "A5_BB_合ら"
strFinal3 = "C1_だと思

So essentially i want to delete the underscore only between two japanese characters. How can i do this in python?

I don't know how to remove an underscore between two characters, but you should know that in Python you can get a number from a character with ord() built-in function. Japanase characters will obviously have much higher numbers. — FLAK-ZOSO, Feb 02 '22 at 15:06

score 1 · Answer 1 · answered Feb 02 '22 at 18:56

I'm not familiar with the different sets of Japanese characters, but you should be able to identify Japanese characters based on their Unicode code points, which should lie within one of the following ranges:

Hiragana: 3040-309f
Katakana: 30a0-30ff
Kanji: 4e00-9fbf

Note that different sources may also include other ranges, such as 1 or 2. The ones that I listed should definitely be included, but you should figure out which other ranges you also want to cover as well, and then extend the is_japanese_char function shown below.

import re

def is_japanese_char(ch):
    assert(len(ch) == 1)  # only use this for single character strings
    if re.search("[\u3040-\u309f]", ch):
        return True  # is hiragana
    if re.search("[\u30a0-\u30ff]", ch):
        return True  # is katakana
    if re.search("[\u4e00-\u9faf]", ch):
        return True  # is kanji
    return False

Now that you can identify Japanese characters, you can iterate over each character in the string, and remove all unwanted characters, like this:

def is_bad_underscore(ch, prev_ch, next_ch):
    if ch != "_":
        return False
    if not is_japanese_char(prev_ch):
        return False
    if not is_japanese_char(next_ch):
        return False
    return True


def remove_bad_underscores(s):
    new_string = s[0]
    for i, ch in enumerate(s[1:-1], start=1):  # skip first and last
        if not is_bad_underscore(ch, s[i-1], s[i+1]):
            new_string += ch
    return new_string + s[-1]

It's not the cleanest code, and can be optimized, but it works.

print(remove_bad_underscores("3F_う_が_LOW_まい_が") == "3F_うが_LOW_まいが") # True
print(remove_bad_underscores("A5_BB_合_ら") == "A5_BB_合ら") # True
print(remove_bad_underscores("C1_だ_と_思") == "C1_だと思") # True

sideshowbarker · Answer 2 · 2022-02-03T08:25:41.937

To supplement Alan Verresen’s answer a bit: For slightly-more human-readable code, you can:

use the regex module rather than the re module
use Unicode-script category properties rather than explicitly specifying code-point ranges

import regex

def is_japanese_char(ch):
    assert(len(ch) == 1)  # only use this for single character strings
    if regex.search("\p{Hiragana}", ch):
        return True  # is hiragana
    if regex.search("\p{Katakana}", ch):
        return True  # is katakana
    if regex.search("\p{Han}", ch):
        return True  # is kanji
    return False

The regex module supports that \p{} syntax but the re module doesn’t yet, as far as I know. For more info on matching other categories of Unicode properties, see also the answers at:

FLAK-ZOSO · Answer 3 · 2022-02-02T16:40:21.217

0

You should check built-in function ord:

>>> ord('a')
97
>>> ord('が')
12364

As you can notice, a Japanase character has a much higher number returned when passed as argument to ord, so you can use this difference this way:

# Where i is the index of an _ in the string
if (ord(string[i+1]) > 500 and ord(string[i-1]) > 500):
    # The _ is between two not-european characters

This should work:

string: str = list(input())

for index, element in enumerate(string):
    if (index == 0):
        continue
    # Where index is the index of an _ in the string
    if (element == '_'):
        # The _ is between two not-european characters
        if (ord(string[index+1]) > 500 and ord(string[index-1]) > 500):
            string[index] = ' '

string = ''.join(string)

edited Feb 02 '22 at 16:40

answered Feb 02 '22 at 15:09

FLAK-ZOSO

3,873
4
8
28

OP, remember to do bounds check so that you don't get UB – Samathingamajig Feb 02 '22 at 15:10
@Samathingamajig UB regarding array bounds doesn't exist in Python. – AKX Feb 02 '22 at 15:17
500 sounds very arbitrary, I would like to see some reasoning behind the cut-off. – Jan Jaap Meijerink Feb 02 '22 at 15:17
(However, there is a subtle bug in here when `index` is 0: then `string[index - 1]` refers to the last character in the string.) – AKX Feb 02 '22 at 15:18
@JanJaapMeijerink, yes, it's an arbitrary cut, but the opposite of ord (at least in terms of returned value) is chr(), and I couldn't find any european letters over index 500. – FLAK-ZOSO Feb 02 '22 at 15:22
@FLAK-ZOSO UB as unexpected/undefined behavior, perhaps not the right word, but an edgecase, because this would remove `_` if string was "_がが" or "がが_" – Samathingamajig Feb 02 '22 at 15:36
You are right, I had not thought about the -1 index, I'm going to edit my answer. – FLAK-ZOSO Feb 02 '22 at 15:39
You may be able to use the unicodedata.category() to discriminate characters, though this may be somewhat slower: https://www.unicode.org/reports/tr44/#General_Category_Values I think asian letters should appear in category Lo – Max Feb 02 '22 at 16:57
A better and faster solution would be to put the condition as the characters between x and y where x is the number of the first japanase character and y is the last (chars are sorted by language I guess) – FLAK-ZOSO Feb 02 '22 at 17:09

Delete underscores between japanese characters from a string in python

3 Answers3