Find and Replace Japanese Text Between Brackets?

Question

I'm trying to find and replace all the text between several lines of Japanese text similar to:

雨[あめ]も 降[ふ]るし 強[つよ]い 風[かぜ]が 吹[ふ]くし、ひどい 天気[てんき]ですね。
It rains and strong wind blows, the weather is rough.

今日[きょう] 降[ふ]らなさそうですね。
It seems like it will not rain today. (Probably said while looking at the sky.)

分[わ]からない 単語[たんご]がいっぱいなので、 難[むずか]しそうです。
It seems difficult because there are many words I do not know.

明日[あした] 新[あたら]しいレストランに 行[い]ってみますか？
Will you try and go to the new restaurant tomorrow?

And make it without the text in the brackets:

雨も降るし強い風が吹くし、ひどい天気ですね。 It rains and strong wind blows, the weather is rough ...

Normally I would use a replaceAll() from the string method with a replaceAll("\\[\\w*\\]", ""), but this does not work here. Additionally, there is a combination of Kanji and hiragana, so I'm really lost as to how to use any type of system including Unicode character expansion given the multiple character sets.

According to this post (https://stackoverflow.com/a/10810002/2058221), \\w should be aware of text characters in any language, so I'm not sure why it won't work in this application.

Do you want to match any word chars inside the brackets? Add `(?U)` then - `replaceAll("(?U)\\[\\w*]", "")` — Wiktor Stribiżew, Mar 13 '23 at 09:18
How about using `replaceAll("\\[.+?\\]", "");` Please let me know, if it works — Tushar, Mar 13 '23 at 12:51
@WiktorStribiżew This approach worked well, but why is the `(?U)` character required? Doesn't the matcher assume unicode by default? — Sarah Szabo, Mar 13 '23 at 18:29
@Tushar This also worked very well ^^ I didn't think to just invert it similar to typical logical thinking. Thanks! — Sarah Szabo, Mar 13 '23 at 18:29

phil · Accepted Answer · 2023-03-13T14:37:02.320

0

You attempt was very close. Use replaceAll("\\[[^\\]]+\\]", ""). This looks for an opening square brace, then any character which is not a closing square brace one or more times until it finds a closing square brace.

Or, if you want to limit what's between the square braces to Hiragana, Katakana and punctuation then: replaceAll("\\[[\u3000-\u30ff]+\\]", "").

edited Mar 13 '23 at 14:37

answered Mar 13 '23 at 14:30

phil

1,938
4
23
33

Would the `replaceAll("\\[[\u3000-\u30ff]+\\]", "")` also match Kanji characters (Japanese uses hiragana for marking the conjugations, Katakana for borrowed words, and Kanji for the main work itself: 食べる (To eat something) = 食 (Kanji), べる (Hiragana) – Sarah Szabo Mar 13 '23 at 18:31
@SarahSzabo CJK unified ideographs are at range 4e00-9faf, so `replaceAll("\\[[\u3000-\u30ff\u4e00-\u9faf]+\\]", "")`. See [Unicode Character Table](https://unicode.org/charts/PDF/U4E00.pdf). – phil Mar 13 '23 at 21:09

Find and Replace Japanese Text Between Brackets?

1 Answers1