0

I am working on a project that requires the parsing of "formatting tags." By using a tag like this: <b>text</b>, it modifies the way the text will look (that tag makes the text bold). You can have up to 4 identifiers in one tag (b for bold, i for italics, u for underline, and s for strikeout).

For example:

<bi>some</b> text</i> here would produce some text here.

To parse these tags, I'm attempting to use a RegEx to capture any text before the first opening tag, and then capture any tags and their enclosed text after that. Right now, I have this:

<(?<open>[bius]{1,4})>(?<text>.+?)</(?<close>[bius]{1,4})>

That matches a single tag, its enclosed text, and a single corresponding closing tag.

Right now, I iterate through every single character and attempt to match the position in the string I'm at to the end of the string, e.g. I attempt to match the whole string at i = 0, a substring from position 1 to the end at i = 1, etc.

However, this approach is incredibly inefficient. It seems like it would be better to match the entire string in one RegEx instead of manually iterating through the string.

My actual question is is it possible to match a string that does not match a group, such as a tag? I've Googled this without success, but perhaps I've not been using the right words.

  • Does your input have to contain only nested tags or can tags overlap? In other words, is ' foo bar baz ' legal input? – Mark Byers Dec 05 '09 at 01:58
  • You say it's inefficient. Does it really affect the process? Have you profiled it? – Esteban Küber Dec 05 '09 at 01:59
  • @Mark You can use both text type tags and textmoretext. –  Dec 05 '09 at 02:01
  • @darkassassin93: Sorry, but I don't understand your answer to my question. Are you saying that overlapping tags like ' foo bar baz ' is allowed input, or not allowed input? – Mark Byers Dec 05 '09 at 02:17
  • @Mark Oh, sorry, I misunderstood your question before. Yes, that is also valid input. The way I did it before was if I came across a tag, I would perform a bitwise operation on the current `FontStyle` based on its identifier and if it's a closing tag or not. –  Dec 05 '09 at 02:19

3 Answers3

1

I think trying to parse and validate the entire text in one regular expression is likely to give you problems. The text you are parsing is not a regular language, so regular expressions are not well designed for this purpose.

Instead I would recommend that you first tokenize the input to single tags and text between the tags. You can use a simple regular expression to find single tags - this is a much simpler problem that regular expressions can handle quite well. Once you have tokenized it, you can iterate over the tokens with an ordinary loop and apply formatting to the text as appropriate.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Thanks, that's a lot cleaner and easier than the way I was doing it before! –  Dec 05 '09 at 02:18
0

Try prefixing your regex with ^(.*?) (match any characters from the beginning of the string, non-greedy). Thus it will match anything at all that occurs at the start of the string, but it will match as little as it can while still having the rest of the regex match. Thus you'll grab all of the stuff that wasn't matched normally in that first capture group.

Amber
  • 507,862
  • 82
  • 626
  • 550
0

Why don't you use an HTML parser for this?

You should be using an XML parser, not regexes. XML is not a regular language, hence not easely parseable by a regular expression. Don't do it.

Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.

Community
  • 1
  • 1
Esteban Küber
  • 36,388
  • 15
  • 79
  • 97
  • It's not valid HTML, so that wouldn't help. – Mark Byers Dec 05 '09 at 01:56
  • There are HTML parsers that can handle invalid input. – Esteban Küber Dec 05 '09 at 01:57
  • Yes, usually by ignoring unknown tags. How would a HTML parser handle input like " foo bar baz qux quux "? I would think it would try to match the start and end tags, but in this case that behaviour is not wanted. 'baz' is not within any tags. – Mark Byers Dec 05 '09 at 02:24