3

There is a better way to simplify this Regex to a more terse format, but I can't seem to implement character groups properly for reuse. Any other recommendations for how to better go about accomplishing this match would be apprecitated.

Intended match:

<Formatting Type="B">any text</Formatting>

This could be nested within other Formatting tags like so

<Formatting Type="B"><Formatting Type="I">any text</Formatting>any text</Formatting>

The following Regex does the trick, but seems more complicated than it should be, as I am repeating myself with this section three times

The end goal is to replace all instances of <Formatting with standard HTML tags <B> <I> <U> etc.

[\040\w!\?\:\.]*

Overall Regex is the following

<Formatting Type="[BIU]{1}">([\040\w!\?\:\.]*(<[BIU]>)*[\040\w!\?\:\.]*(</[BIU]>)*[\040\w!\?\:\.]*)*</Formatting>
Charles
  • 50,943
  • 13
  • 104
  • 142
Chris Ballance
  • 33,810
  • 26
  • 104
  • 151

2 Answers2

2

I think this is what you were trying for:

<Formatting Type="([BIU])">([ \w!?:.]*(?:</?[BIU]>[ \w!?:.]*)*)</Formatting>

There's no need to have separate productions for opening and closing HTML tags, any more than you need to distinguish between <B>, <I> and <U> tags. All that matters is that, after you match an opening <Formatting> tag, you don't consume any more opening tags before the closing </Formatting> tag. If the original tags are correctly nested, the HTML tags will be, too.

I'm assuming there are only the three types of formatting, and there won't be any other angle brackets or tag-like things in the text. That being the case, you don't need to be so restrictive with the regex.

text = Regex.Replace(text,
    @"<Formatting Type=""([BIU])"">([^<]*(?:</?[BIU]>[^<]*)*)</Formatting>",
    @"<$1>$2</$1>");

Of course, you'll need to make multiple passes over the text to be sure you've replaced all the tags. Given your sample text:

<Formatting Type="B"><Formatting Type="I">any text</Formatting>any text</Formatting>

...after the first pass it would be changed to:

<Formatting Type="B"><I>any text</I>any text</Formatting>

...and after the second pass:

<B><I>any text</I>any text</B>
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

I think that you will find this very difficult, especially due to the fact that Formatting tags can be nested within each other.

You may want to avoid being driven to madness, as apparently this fellow StackOverflow user was.

This answer suggests that it can be done with the use of "balanced matching".

You might be better off trying to use an XML technology to accomplish this (maybe XSLT) instead of regex.

Community
  • 1
  • 1
Dr. Wily's Apprentice
  • 10,212
  • 1
  • 25
  • 27