1

I have a huge XML file and I need to extract the content of a whole tag that contains a sequence of numbers. Everything is one line in my file, I added line breaks here to make it more readable

So here I have a simplified example

The file:

<ORDERS>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>34567</tag3><tag4>ccc</tag4></IDOC>
</ORDER>

I want to match the IDOC BEGIN tag that contains the sequence 0007537181. So it would be

<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>

So far I got this regex:

cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>'

Which results in everything from the beginning of the first tag with the same name until the one that I want:

<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>

I managed to work around this by sending this to a second regex that gets the last occurrence of IDOC BEGIN

cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>' | grep -oP '<IDOC BEGIN(?!.*<IDOC BEGIN).*?</IDOC>'

To summarize, I need to get the last IDOC BEGIN before the sequence of number

Please keep in mind that the original file does not have line breaks, everything is in one line.

1 Answers1

1

The regex you could use is either based on a greedy dot pattern placed at the start and followed with a \K match reset operator, or based on a tempered greedy token. Both are very unsafe when it comes to large strings with partial matches (but not matching).

So, the two regexps are

.*\K<IDOC BEGIN.*?0007536846.*?</IDOC>
<IDOC BEGIN(?:(?!<IDOC BEGIN).)*?0007536846(?:(?!<IDOC BEGIN).)*?</IDOC>

The best idea is to unroll the tempered greedy token in these cases:

<IDOC BEGIN[^<]*(?:<(?!IDOC BEGIN)[^<]*?)*0007537181.*?</IDOC>

See the regex demo

The first .*? is replaced with [^<]*(?:<(?!IDOC BEGIN)[^<]*?)*:

  • [^<]* - a negated character class matching 0 or more chars other than <, as many as possible
  • (?:<(?!IDOC BEGIN)[^<]*?)* - 0 or more repetitions of
    • <(?!IDOC BEGIN) - a < char that is not immediately followed with IDOC BEGIN string
    • [^<]*? - a negated character class matching 0 or more chars other than <, as few as possible
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563