I have a huge XML file and I need to extract the content of a whole tag that contains a sequence of numbers. Everything is one line in my file, I added line breaks here to make it more readable
So here I have a simplified example
The file:
<ORDERS>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>34567</tag3><tag4>ccc</tag4></IDOC>
</ORDER>
I want to match the IDOC BEGIN tag that contains the sequence 0007537181. So it would be
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
So far I got this regex:
cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>'
Which results in everything from the beginning of the first tag with the same name until the one that I want:
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
I managed to work around this by sending this to a second regex that gets the last occurrence of IDOC BEGIN
cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>' | grep -oP '<IDOC BEGIN(?!.*<IDOC BEGIN).*?</IDOC>'
To summarize, I need to get the last IDOC BEGIN before the sequence of number
Please keep in mind that the original file does not have line breaks, everything is in one line.