RegEx to get unique values from large file with duplicates

Question

I have a large XML-file that I want to extract unique values from. The values I'm looking for are placed in the XML-tag: ns3:order_id To make it more complex, the file contains duplicates of order_id, and I'm only interested in geeting the unique order_id values.

I've been using RegEx to extract the values, this is the expression:

(?sm)(\<ns3:order_id>\d+\b)(?!.*\1\b)

The expression gives me what I need, BUT only if the file is way smaller. When I try this expression on the "big" file I receive: "Catastrophic backtracking has been detected and the execution of your expression has been halted." I guess it has with *, and I have tried different ways replacing it without success.

Is there any way to correct my expression so that I can collect the values?

As seen in the text above, I've tried several diffrent RegEx ways. The expression above works, but not in bigger files.

Why not extract all values not caring about duplicates and pipe to sort and uniq (or some other post-regex processing)? Also worth updating the question to clarify why you want to parse xml with a refer at all :) — AD7six, Dec 07 '22 at 11:13
You should really not use regex on XML files, regex is not meant for this in the majority of XML manipulation tasks. — Wiktor Stribiżew, Dec 07 '22 at 11:33
I agree, there are better ways to read values from XML than using regex. See this famous (and funny) answer: https://stackoverflow.com/a/1732454/1288408 — Modus Tollens, Dec 07 '22 at 11:36
It sound like using xPath with "distinct-values" is a better way to go? — jackii, Dec 07 '22 at 11:53

RegEx to get unique values from large file with duplicates

0 Answers0