Check for RegEx required

Question

We're currently implementing a little tag system into our software. There are just two different tag styles: single ones and multiple ones.

The single ones look like this:

<<Single_Tag>>

The multiple ones look like this:

<<Multiple_Tag*>>
... stuff between tag ...
<</Multiple_Tag*>>

The RegEx to find the single ones would be:

<<\w+>>

The RegEx to find the multiple ones would be:

<<(\w+)\*{1}>>((.|\s)*)<</(\w+)\*{1}>>

Are the {1}'s required? Am I right, that (.|\s)*needs to be greedy? Otherwise this RegEx would fail on:

<<multiple_tag1*>>
    <<multiple_tag2*>>

    <</multiple_tag2*>>
<</multiple_tag1>>

Is there maybe an easier way with capturing groups? Excuse me, if the following syntax is wrong. The last time I've used RegEx is years ago:

<<(\w+)\*{1}>>((.|\s)*)<</($1)\*{1}>>

That $1stands for the first capturing group. I'm developing in .NET. I checked these on RegExr, already. But I just remember: it's very easy to overlook something while working with RegEx.

Just my opinion: Keep the XML way defining tags (i.e. use `` and `...`). With your current implementation you might confuse your users, if they ever try to edit your tags by hand. If you can't use `<` and `>` think about escaping them (e.g. HTML entities) or use other brackets like `[` and `]`. — Mario, Feb 06 '12 at 14:51
We can't use HTML brackets. But I also can't edit an actual string. This rich text editor has an `document` property, which is not convertable to a string. So there is no way to escape `<` and `>` brackets. That caused our own syntax. — Michael Schnerring, Feb 06 '12 at 15:04

score 0 · Accepted Answer · edited May 23 '17 at 09:59

0

See the following post about parsing html with regex as it applies to this as well (my fav. ever stack-overflow post).

RegEx match open tags except XHTML self-contained tags

Update

One way of solving this is to:

1) Build a tokenizer that tokenizes your input into sequence of tokens where each token is one of:

* Non-Tag (contains all the content)
* Open-Tag (contains the name of the tag)
* Close-Tag  (contains the name of the tag)

2) Call the tokenizer in a loop, and manualy keep count of the opening closing tags, making sure that they balance correctly.

Step (1) could be automated with a lexer generator. In theroy step (2) could be automated by a parser generator, but this may be overkill in this case.

A common lexer and parser generator used in .NET is ANTLR

Example

This input

<<Multiple_Tag*>>
... stuff between tag ...
<</Multiple_Tag*>>

Would generate the following tokens:

 1. Open-Tag("Multiple_Tag")
 2. Non-Tag("\n    ... Stuff between tag ... \n")
 3. Close-Tag("Multiple_Tag")

edited May 23 '17 at 09:59

Community

1
1

answered Feb 06 '12 at 14:42

Andrew Skirrow

3,402
18
41

I think this is pretty nicely written. But I don't try to parse HTML. It's a system of finding and replacing tags in a rich text (for bulk letter purposes). I need to use the built in find and replace method of the rich text editor control. That's because, if you format the tags text (i.e. coloring it red), it should be red after replacing, as well. There is no way to parse a string, because the content cannot be transformed to a string (*.docx). – Michael Schnerring Feb 06 '12 at 14:54
@ebeeb The message is that you can't process balanced tags with regex (e.g. ) as this requires recursion which isn't possible with regex. Your example has balanced tags, so you cannot process this with regex. – Andrew Skirrow Feb 06 '12 at 15:11
So what would be the alternative? – Michael Schnerring Feb 06 '12 at 15:30
I did it pretty similar to your solution. I didn't check the balance of the open/close tags. I pushed the parent tag on a stack, when I hit an opening multi tag. And when I hit a closing tag I popped it. That's just been simply described. The actual implementation was much more complicated. And I think it was pretty similar to a parser. I stepped through every single line. – Michael Schnerring Feb 09 '12 at 07:17

score 0 · Answer 2 · answered Feb 06 '12 at 14:48

0

Regular expressions cannot be used to keep count. If you need anything to count things, such as how many of your <<Multiple_Tag*>> has been passed you will need a proper parser.

answered Feb 06 '12 at 14:48

Dervall

5,736
3
25
48

There is no need for counting things. This system is hierarchical. So I always step further into the outer match, don't I? – Michael Schnerring Feb 06 '12 at 15:02
regex doesn't do well with nesting & heirarchy. Is regex part of the user requirements or something? – deltree Feb 06 '12 at 16:04

Check for RegEx required

2 Answers2