2

I want to check for duplicated words right next to each other, but even if there is punctuation in between.

For example:

Vivamus Vivamus diam, diam, Vivamus Vivamus diam, diam Vivamus

There should be four distinct hits here.

I can't figure out why this isn't working. Why? What should the correct code be?

(\w*(?:[ ,\.])*?)\1

PS: This is not necessarily for the Perl engine.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Keng
  • 52,011
  • 32
  • 81
  • 111

3 Answers3

8

The (?: is a non-capturing parenthesis, meaning it won't store the matches. You will need to use capturing parentheses.

(\w+)\W+\1
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
TJ L
  • 23,914
  • 7
  • 59
  • 77
1

[[\w|\W]+ ]+ worked for me. Breakdown:

\w: word character

\W: non-word character

[\w|\W]+: each character may be a word or non-word character and repeated one or more times

[[\w|\W]+ ]+: ...appended with a space at some point, all occurring one or more times

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Stunner
  • 12,025
  • 12
  • 86
  • 145
0

The original expression doesn't create a separate capture for the punctuation, but does include the captured punctuation in the first capture. That means it would spot things like:

diam, diam, really, really, twice.

But you aren't really interested in the punctuation, so TJ L's solution works properly, even though the '(?: ) is a non-capturing parenthesis' explanation is somewhat ... incomplete? The comment quoted is accurate, but it isn't why the overall regex failed.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278