0

I'm trying to find and replace all the text between several lines of Japanese text similar to:

雨[あめ]も 降[ふ]るし 強[つよ]い 風[かぜ]が 吹[ふ]くし、ひどい 天気[てんき]ですね。
It rains and strong wind blows, the weather is rough.

今日[きょう] 降[ふ]らなさそうですね。
It seems like it will not rain today. (Probably said while looking at the sky.)

分[わ]からない 単語[たんご]がいっぱいなので、 難[むずか]しそうです。
It seems difficult because there are many words I do not know.

明日[あした] 新[あたら]しいレストランに 行[い]ってみますか?
Will you try and go to the new restaurant tomorrow?

And make it without the text in the brackets:

雨も 降るし 強い 風が 吹くし、ひどい 天気ですね。 It rains and strong wind blows, the weather is rough ...

Normally I would use a replaceAll() from the string method with a replaceAll("\\[\\w*\\]", ""), but this does not work here. Additionally, there is a combination of Kanji and hiragana, so I'm really lost as to how to use any type of system including Unicode character expansion given the multiple character sets.

According to this post (https://stackoverflow.com/a/10810002/2058221), \\w should be aware of text characters in any language, so I'm not sure why it won't work in this application.

Sarah Szabo
  • 10,345
  • 9
  • 37
  • 60

1 Answers1

0

You attempt was very close. Use replaceAll("\\[[^\\]]+\\]", ""). This looks for an opening square brace, then any character which is not a closing square brace one or more times until it finds a closing square brace.

Or, if you want to limit what's between the square braces to Hiragana, Katakana and punctuation then: replaceAll("\\[[\u3000-\u30ff]+\\]", "").

phil
  • 1,938
  • 4
  • 23
  • 33
  • Would the `replaceAll("\\[[\u3000-\u30ff]+\\]", "")` also match Kanji characters (Japanese uses hiragana for marking the conjugations, Katakana for borrowed words, and Kanji for the main work itself: 食べる (To eat something) = 食 (Kanji), べる (Hiragana) – Sarah Szabo Mar 13 '23 at 18:31
  • @SarahSzabo CJK unified ideographs are at range 4e00-9faf, so `replaceAll("\\[[\u3000-\u30ff\u4e00-\u9faf]+\\]", "")`. See [Unicode Character Table](https://unicode.org/charts/PDF/U4E00.pdf). – phil Mar 13 '23 at 21:09