0

I'm trying to create regex that will remove furigana (ruby) from Japanese words:

<ruby><rb>二度</rb><rp>(</rp><rt>にど</rt><rp>)</rp>と</ruby> //old string
二度と // new string

I created new = old.replace(/<rt>.*<\/rt>/,'').replace(/<rp>.*<\/rp>/,'').replace('<ruby><rb>','').replace('</rb></ruby>','') and it works... almost.

When there are multiple ruby tags, it doesn't work at desired:

<ruby><rb>息</rb><rp>(</rp><rt>いき</rt><rp>)</rp></ruby>を<ruby><rb>切</rb><rp>(</rp><rt>き</rt><rp>)</rp></ruby>らして
息らして //new string, using function above (wrong)
息を切らして //should be this

I'm very new to RegExp, so I'm not sure how to handle this one.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Lazar Ljubenović
  • 18,976
  • 10
  • 56
  • 91
  • I see you are trying to replace everything with the empty string so how are you getting `二度と` ? – Ibrahim Najjar Aug 22 '13 at 09:38
  • @Sniffer Not really, it should leave only content of ``rb`` in ``ruby`` and anything outside ``ruby``. – Lazar Ljubenović Aug 22 '13 at 09:39
  • @BenjaminGruenbaum Yes, non-regex is fine as long as it works. It's JavaScript. – Lazar Ljubenović Aug 22 '13 at 09:43
  • Rather than trying to handle this using regexp, it's cleaner and easier to use the DOM APIs. –  Oct 14 '15 at 12:17
  • Your clarification *it should leave only content of rb in ruby and anything outside ruby* is not consistent with your stated desired output, which includes the と character. Actually, are you sure the と is valid markup? The spec says the content model for `` is `(rb, (rt | (rp, rt, rp)))`. –  Oct 14 '15 at 12:23

1 Answers1

1

Try to use

var newstring = oldstring.replace(/<rb>([^<]*)<\/rb>|<rp>[^<]*<\/rp>|<rt>[^<]*<\/rt>|<\/?ruby>/g, "$1");

The idea here is to capture rb tags content to put it in replacement pattern, rp and rt tags are removed with their content, and ruby tags are removed too.

Content between tags is described with [^<] (all that is not a <) since these tags (rb, rp, rt) can't be nested.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125