-1

The regular expression (.NET/C# flavor, or PS flavor) listed below matches elements and inner text in a simple XML file line by line (multiline matches are not necessary).

^[^<]*<(?'Element'[^>\s]*)[^>]*>(?'Text'[^<]*)<\/\1>\s*$

This regex matches the following inputs correctly and quite efficiently:
See the online simulation.

    <ELEMENT>inner text</ELEMENT>
       <ELEMENT>inner text</ELEMENT>
       <ELEMENT>inner text</ELEMENT>   
    <ELEMENT >   inner text   </ELEMENT>
    <ELEMENT >   inner text   </ELEMENT>   
       <ELEMENT >   inner text   </ELEMENT>   
    <ELEMENT ATTRIB="foo">   inner text   </ELEMENT>   
       <ELEMENT ATTRIB="foo">   inner text   </ELEMENT>

However cases which are not supposed to match, perform correctly but they incur a lot of backtracking and thus are very inefficient:
See the online simulation.

      ELEMENT ATTRIB="foo">   inner text   </ELEMENT>   
    < ELEMENT ATTRIB="foo">   inner text   </ELEMENT>   
       < ELEMENT ATTRIB="foo">   inner text   </ELEMENT>   
    <ELEMENT>inner text</FOO>
    ELEMENT ATTRIB="foo">   inner text   </ELEMENT>  

QUESTION: Can I use atomic groups to prevent this backtracking and speed up the mismatching performance without slowing down the matching performance ...and how?

If .Net & PS supported possessive quantifiers, I would be asking about them, too.

P.S.
This question is applicable not only to XML inputs. It is about general regex optimization with atomic groups in .NET or PS - not about processing this particular XML input.

  • 2
    Suggestion: use an XML parser mate! You will be much better with it. – Jorge Campos Dec 13 '22 at 00:38
  • Take a look here, I don't think that post has nearly enough good examples or ways of doing it: https://stackoverflow.com/questions/642293/how-do-i-read-and-parse-an-xml-file-in-c – Jorge Campos Dec 13 '22 at 00:49

1 Answers1

0

Can I use atomic groups to prevent this backtracking and speed up the mismatching performance without slowing down the matching performance

No. Every time one uses * (zero or more) in a pattern, it by its definition will use backtracking to find every possible occurrence that could happen.

Your pattern is so backtracking friendly that providing an answer to this question which does everything you want would be foolhardy.

Regex is a pattern matching tool and not a lexical analysis tool. I believe you are confusing the two in this situation.

As mentioned in the comments use a tool more suited to either HTML or XML to parse and analyze your data.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122