1

I have a problem trying to parse some OpenXML Standard (docx). We use expressions like {Contact.MailAddress} and fill this in from data in a second step. However, the way Word (and LibreOffice) are, is that they sometimes split up these tags like this:

<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>{</w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>Contakt.MailAddress</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>}</w:t>
</w:r>

So I made the following regex:

(?<expr>{)((?<tag><[^>]+>)|(?<expr>[\w\s.]+))+(?<expr>})

Everything in the expr groups is part of the {Contact.MailAddress} expression and gets merged together. Everything in the tag groups gets merged into tags to fix the xml together later.

Now, this works pretty well. However, when we use our {foreach} syntax, the xml might get quite big and then we have a runaway case.

Can anyone think of a regex, that captures this better that doesn't result in a runaway?

EDIT 1: The program is written in C#/.NET. For the regex flavor.

EDIT 2: I took another approach: I listed all the matches where there is a match like {[^}]} and within there I replace all tags and whitespaces with nothing:

var matches = Regex.Matches(xml, @"{[^}]+}")
    .Cast<Match>()
    .OrderByDescending(x => x.Index)
    .ToList();

foreach (var match in matches)
{
    var replacement = Regex.Replace(match.Value, @"<[^>]+>", "");
    replacement = Regex.Replace(replacement, @"\s+", "");
    xml = xml.Substring(0, match.Index) + replacement + xml.Substring(match.Index + match.Length);
}

The trick is to order the matches by index decending so the math in Substring works.

LueTm
  • 2,366
  • 21
  • 31

1 Answers1

1

It looks like you want to remove all tags and whitespace between { and }. If you're not worried about other braces that shouldn't be matched, this should work:

s = Regex.Replace(s, 
    @"(?<brace>{)\s*(?:<[^<>]+>\s*)*|\s*(?:<[^<>]+>\s*)*(?<brace>})", 
    @"${brace}");

For safety's sake, you might want to add the nearest actual tags (assuming they're always the same):

@"(?<brace>{)</w:t>\s*(?:<[^<>]+>\s*)*|\s*(?:<[^<>]+>\s*)*<w:t>(?<brace>})"

With either regex, I get this result:

<w:r w:rsidRPr="00E22BCD">
    <w:rPr>
        <w:rFonts w:eastAsia="Times New Roman"/>
        <w:lang w:val="fr-CH"/>
    </w:rPr>
    <w:t>{Contakt.MailAddress}</w:t>
</w:r>

...and there's no backtracking at all.

EDIT:

Turns out the tags are also being inserted before and after the dot inside the braces. My original solution doesn't work for that, so here's a two-stage approach that finds the brace-enclosed text and replaces it with the same text with tags and whitespace removed:

s = Regex.Replace(s, 
    @"{[^{}]*}", 
    m => Regex.Replace(m.Value, @"\s*(?:<[^<>]+>\s*)*", ""));
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Ah! Sometimes it also adds tags before dots. So I guess what i'm looking for is a regex replace that removes all xml tags and spaces between { and }... I'm trying to figure out one. If you're faster I'd appreciate the help :) – LueTm Jan 05 '16 at 14:38
  • I don't understand, can you be more specific? – Alan Moore Jan 05 '16 at 14:59
  • The problem was, that Word also places tags between The `Contact` and the `MailAddress` part, not only just before or just after the } and { respectively. I posted my solution in my question. Is there a more elegant solution maybe? – LueTm Jan 05 '16 at 15:19
  • You're correct that a multi-stage solution is needed; fortunately, .NET provides a much neater way, using a MatchEvaluator. See my update. – Alan Moore Jan 05 '16 at 15:59