1

I have a large XML-file that I want to extract unique values from. The values I'm looking for are placed in the XML-tag: ns3:order_id To make it more complex, the file contains duplicates of order_id, and I'm only interested in geeting the unique order_id values.

I've been using RegEx to extract the values, this is the expression:

(?sm)(\<ns3:order_id>\d+\b)(?!.*\1\b) 

The expression gives me what I need, BUT only if the file is way smaller. When I try this expression on the "big" file I receive: "Catastrophic backtracking has been detected and the execution of your expression has been halted." I guess it has with *, and I have tried different ways replacing it without success.

Is there any way to correct my expression so that I can collect the values?

As seen in the text above, I've tried several diffrent RegEx ways. The expression above works, but not in bigger files.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
jackii
  • 11
  • 1
  • Why not extract all values not caring about duplicates and pipe to sort and uniq (or some other post-regex processing)? Also worth updating the question to clarify why you want to parse xml with a refer at all :) – AD7six Dec 07 '22 at 11:13
  • 1
    You should really not use regex on XML files, regex is not meant for this in the majority of XML manipulation tasks. – Wiktor Stribiżew Dec 07 '22 at 11:33
  • I agree, there are better ways to read values from XML than using regex. See this famous (and funny) answer: https://stackoverflow.com/a/1732454/1288408 – Modus Tollens Dec 07 '22 at 11:36
  • It sound like using xPath with "distinct-values" is a better way to go? – jackii Dec 07 '22 at 11:53

0 Answers0