2

I'm trying to regex match any duplicate words (i.e. alphanumeric and can have dashes) in some yaml with a PCRE tool.

I have found a consecutive, duplicate regex matcher:

(?<=,|^)([^,]*)(,\1)+(?=,|$)

it will catch:

hello-world,hello-world,goodbye-world,goodbye-world

but not the hello-worlds in

hello-world,goodbye-world,goodbye-world,hello-world

Could someone help me try to build a regex pattern for the second case (or both cases)?

anubhava
  • 761,203
  • 64
  • 569
  • 643
torrho
  • 1,823
  • 4
  • 16
  • 21

2 Answers2

4

You may use this regex:

(?<=,|^)([^,]+)(?=(?>,[^,]*)*,\1(?>,|$)),

RegEx Demo

RegEx Details:

  • (?<=^|,): Assert that we have , or start position before current position
  • ([^,]+): Match 1+ of non-comma text and capture in group #1
  • (?=(?>,[^,]*)*,\1(?>,|$)): Lookahead to assert presence of same value we captured in group #1 ahead of us
  • ,: Match ,
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    oh interesting! This is also a valid regex. @anubhava thank you for your explanation! :-) – torrho Jul 27 '22 at 18:53
  • Just for my understanding are you looking to match only `hello-world` and not match `goodbye-world` using this regex? – anubhava Jul 27 '22 at 19:05
  • the goal was to catch any duplicate. the regex i originally had only captured consecutive duplicates which was not sufficient. – torrho Jul 27 '22 at 21:48
  • 1
    @torrho: Thanks for confirming and I was also thinking same. But then I don't understand how you've selected `(?<=,|^)([^,]*)(?:,.*)?(,\1)(?=,|$)` as working answer which gives incorrect results i.e. showing only 1 match instead of 2. – anubhava Jul 28 '22 at 10:46
  • 1
    @anubhava the tool I am working with, semgrep, seems to work with the answer provided for the time being. I am still testing but I am beginning to think your original answer - sans the comma, might be _just_ a little more effective in what I am trying to do here. Should my testing change to your answer, i'll revise my "correct answer" selection. – torrho Jul 28 '22 at 16:17
  • 1
    @anubhava :-) I used your original regex as it has tested just a little bit better than the prior answer. Thank you Barmar for your answer as well. They both work for the purpose that I am trying to do however i think anubhava answer has cleaner test results in my tool. thank you! – torrho Jul 28 '22 at 23:30
2

Put an optional ,.* between the capture group and the back-reference.

(?<=,|^)([^,]*)(?:,.*)?(,\1)(?=,|$)

DEMO

Barmar
  • 741,623
  • 53
  • 500
  • 612