I have a problem trying to parse some OpenXML Standard (docx). We use expressions like {Contact.MailAddress}
and fill this in from data in a second step. However, the way Word (and LibreOffice) are, is that they sometimes split up these tags like this:
<w:r w:rsidRPr="00E22BCD">
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman"/>
<w:lang w:val="fr-CH"/>
</w:rPr>
<w:t>{</w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="00E22BCD">
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman"/>
<w:lang w:val="fr-CH"/>
</w:rPr>
<w:t>Contakt.MailAddress</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="00E22BCD">
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman"/>
<w:lang w:val="fr-CH"/>
</w:rPr>
<w:t>}</w:t>
</w:r>
So I made the following regex:
(?<expr>{)((?<tag><[^>]+>)|(?<expr>[\w\s.]+))+(?<expr>})
Everything in the expr
groups is part of the {Contact.MailAddress}
expression and gets merged together. Everything in the tag
groups gets merged into tags to fix the xml together later.
Now, this works pretty well. However, when we use our {foreach} syntax, the xml might get quite big and then we have a runaway case.
Can anyone think of a regex, that captures this better that doesn't result in a runaway?
EDIT 1: The program is written in C#/.NET. For the regex flavor.
EDIT 2: I took another approach: I listed all the matches where there is a match like {[^}]}
and within there I replace all tags and whitespaces with nothing:
var matches = Regex.Matches(xml, @"{[^}]+}")
.Cast<Match>()
.OrderByDescending(x => x.Index)
.ToList();
foreach (var match in matches)
{
var replacement = Regex.Replace(match.Value, @"<[^>]+>", "");
replacement = Regex.Replace(replacement, @"\s+", "");
xml = xml.Substring(0, match.Index) + replacement + xml.Substring(match.Index + match.Length);
}
The trick is to order the matches by index decending so the math in Substring
works.