5

To preface, I am a beginner with regex. I have a string that looks something like:

     my_folder/foo.xml::someextracontent
     my_folder/foo.xml::someextracontent
     another_folder/foo.xml::someextracontent
     my_folder/bar.xml::someextracontent
     my_folder/bar.xml::someextracontent
     my_folder/hello.xml::someextracontent

I want to return unique XML files which are part of my_folder. So the regex will return:

my_folder/foo.xml
my_folder/bar.xml
my_folder/hello.xml

I've taken a look at Extract All Unique Lines which is close to what I need but I am not sure where to go from there.

The closest attempt I got was (?sm)(my_folder\/.*?.xml)(?=.*\1) which gets all the duplicates but I want the opposite, so I tried doing a negative lookahead instead (?sm)(my_folder\/.*?.xml)(?!.*\1) but the capture groups are totally wrong.

What am I missing here in my regex? Here's link to the regex: https://regex101.com/r/ggY2RB/1

sudomodo
  • 51
  • 2
  • 2
    Welcome to SO! Good question, although such a task might be better done with `uniq` or another utility. Would you be open to non-regex solutions? – ggorlen Apr 12 '19 at 01:20
  • Thanks for your suggestion! Unfortunately this is done in java so I can’t do that. I’ve updated the tags to reflect this. I was just wondering if there’s a solution using regex only. Otherwise I can just grab all file names and throw them in a Set – sudomodo Apr 12 '19 at 01:43
  • 3
    If you're using Java, just use a `HashSet`. I bet it's faster than regex. – ggorlen Apr 12 '19 at 01:45

2 Answers2

1

This RegEx might help you to find the unique strings that you might be looking for:

/(\w+\/\w+\.xml)(?![\s\S]*\1)/s

enter image description here

If you only wish to match my_folder, you might try this:

 /(\my_folder\/\w+\.xml)(?![\s\S]*\1)/s

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69
1

Instead of using a positive lookahead (?=, to get the unique strings you could use a negative lookahead (?! to assert what is on the right is not what you have captured in group 1.

In your pattern you are using making the dot match a newline using (?s)and use a non greedy dot start .*? but you might also use a negated character class matching not a newline or a forward slash.

If the folder can also contain nested folders, you might use a pattern that repeats 0+ times 1+ whitespace chars followed by a forward slash.

(?s)(my_folder/(?:[^/\n]+/)*[^/\n]+\.xml)::(?!.*\1)
  • (?s)
  • ( Capture group
    • my_folder/ Match literally
    • (?:[^/\n]+/)* Repeat 0+ times not a forward slash or a newline followed by a forward slash
    • [^/\n]+\.xml Match 1+ ot a forward slash or a newline followed by .xml
  • ) Close capture group
  • ::(?!.*\1) Match :: followed by asserting what is on the right does not contain what is captured in group 1

In Java

String regex = "(?s)(my_folder/(?:[^/\\n]+/)*[^/\\n]+\\.xml)::(?!.*\\1)";

Regex demo | Java demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thanks, then how come `(?sm)(my_folder\/.*?.xml)(?!.*\1)` doesn't work? The capture group gets all the XML file names, and then the negative lookahead should exclude the match if it matches `.*\1` right? My assumption was like `my_folder/test.xml::helloworld my_folder/test.xml` should've been excluded, but instead it included all of it in the capture group – sudomodo Apr 12 '19 at 16:10
  • That is due to the non greedy dot star `.*?` If it can not match it will give up positions until there is a match. – The fourth bird Apr 12 '19 at 17:41