Regex to identify consecutive and non-consecutive duplicate words in multiline text

Question

I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.

What is required:

Find any duplicate words (consecutive and non-consecutive) in the multiline file.

// Example_1 (duplicate 'test'):
item1  , test, item3   ;
item4,item5;
test , item6;

// Example_2 (duplicate 'test'):
item1  , test, test   ;
item2,item3;

I've tried to apply the (\w+)(s*\W\s*\w*)*\1 pattern, which doesn't catch duplicate properly.

Do you consider duplicates even if they are separated with multiple non-word characters? Like `word \n\t--- word`? Or should there only be strictly `[whitespace*][nonword][whitespace*]` between words that are checked for duplication? — Wiktor Stribiżew, Mar 10 '20 at 14:22
By non-consecutive, do you mean `item1, item1` should not be matched? — Chris Clayton, Mar 10 '20 at 14:31
@Alex See my updated answer that won't match if there are consecutive dupes — Wiktor Stribiżew, Mar 10 '20 at 14:44
Suppose the string were `”a, a, a”`. Is only the first word to matched or are both the first and second words to be matched (much easier)? — Cary Swoveland, Mar 10 '20 at 15:09
@Cary Swoveland In the perfect case, I would like to match the first duplicate, because it will make the error handling (need to report the position of the first duplicate) much easier. — J-Alex, Mar 10 '20 at 15:11
I don't think a regex is the right tool for this task. Instead, create a hash whose keys are unique words in the text and whose keys are the number of occurrences of the word in the text. Then select those keys in the hash whose values are greater than one. — Cary Swoveland, Mar 10 '20 at 16:10
...as done [here](https://stackoverflow.com/questions/26282009/how-to-count-the-number-of-occurrences-of-each-word), for example. (It's a one-liner in Ruby: `File.read(fname).split.tally.select { |_,v| v > 1 }.keys`). — Cary Swoveland, Mar 10 '20 at 16:38
So it is a dupe of [this thread](https://stackoverflow.com/a/51190570/3832970) that deals with both consecutive and non-consecutive duplicate words. — Wiktor Stribiżew, Mar 11 '20 at 11:19
@Wiktor Stribiżew with all the respect, the OP's target is to find consecutive duplicates (which is quite frequent case), and searching one of the answers that accidentally covers something additional is like searching a needle in a haystack. Moreover, the thread doesn't work with multiline. — J-Alex, Mar 11 '20 at 11:39
Multiline matching is a [solved issue](https://stackoverflow.com/questions/159118/how-do-i-match-any-character-across-multiple-lines-in-a-regular-expression), too. — Wiktor Stribiżew, Mar 11 '20 at 11:44

anubhava · Accepted Answer · 2020-03-10T14:46:33.750

8

You may use this regex with mode DOTALL (single line):

(?s)(\b\w+\b)(?=.*\b\1\b)

RegEx Demo

RegEx Details:

(?s): Enable DOTALL mode
(\b\w+\b): Match a complete word and capture it in group #1
(?=.*\b\1\b): Lookahead to assert that we have back-reference \1 present somewhere ahead. \b is used to make sure we match exact same word again.

Additionally:

Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:

(?s)(\b\w+\b)(?!\W+\1\b)(?=.*\b\1\b)

RegEx Demo 2

There is one extra negative lookahead assertion here to make sure we don't match consecutive repeats.

(?!\W+\1\b): Negative lookahead to fail the match for consecutive repeats.

edited Mar 10 '20 at 14:46

answered Mar 10 '20 at 14:13

anubhava

761,203
64
569
643

Doesn't this also catch consecutive duplicates? I think it would need to be something like `(?s)(\b\w+\b)(?=.*\b\w+\b.*\b\1\b)` to fit the non-consecutive duplicates rule. – Chris Clayton Mar 10 '20 at 14:26
Do you mean `item1 item1` should not be matched? – anubhava Mar 10 '20 at 14:28
That's my understanding from the non-consecutive specification in the question, but I'm not the OP. @J-Alex? – Chris Clayton Mar 10 '20 at 14:30
Yes, we need OP clarification because the above regex does not meet the "*I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values*" requirement as it may match across any amount of non-word chars. – Wiktor Stribiżew Mar 10 '20 at 14:33
Good catch, any duplicates shouldn't present at all, only unique phrases across all the text lines. – J-Alex Mar 10 '20 at 14:39
@J-Alex: So you want this regex to match `item1` if input is `item1 item1`? – anubhava Mar 10 '20 at 14:41
1

@anubhava correct – J-Alex Mar 10 '20 at 14:42
Thanks Alex, in that case my original regex would work fine. I can remove `update` part from answer. – anubhava Mar 10 '20 at 14:43
1

@anubhava I believe this would also be useful information to share. You can just edit it as an extra-case for further readers. – J-Alex Mar 10 '20 at 14:45
1

ok thanks Alex, that's a good suggestion (updated). – anubhava Mar 10 '20 at 14:47
Your first regex matches no words in “a, a, a;”. Is that consistent with your understanding of the problem. – Cary Swoveland Mar 10 '20 at 15:34
Regex-1 is what OP wanted and that one does match `a` twice in `a, a, a` – anubhava Mar 10 '20 at 15:36
1

My apologies. I must have used your second regex. – Cary Swoveland Mar 10 '20 at 16:25

Wiktor Stribiżew · Answer 2 · 2020-03-10T14:42:17.930

2

You may use

\b(\w+)\b(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b

See the regex demo

Details

\b(\w+)\b - Group 1: one or more word chars as a whole word
(?:\s*[^\w\s]\s*\w+)+ - 1 or more occurrences of:
- \s* - 0+ whitespaces
- [^\w\s] - 1 char other than a word and whitespace char
- \s* - 0+ whitespaces
- \w+ - 1+ word chars
\s* - 0+ whitespaces
- [^\w\s] - 1 char other than a word and whitespace char
- \s* - 0+ whitespaces
\b\1\b - the same value as in Group 1 as whole word.

To only match the word, put the second part of the regex into a positive lookahead:

\b(\w+)\b(?=(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b)
         ^^^                                        ^

See this regex demo.

Java regex variable declaration:

String regex = "\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";

To make it fully Unicode aware add (?U):

String regex = "(?U)\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";

edited Mar 10 '20 at 14:42

answered Mar 10 '20 at 14:13

Wiktor Stribiżew

607,720
39
448
563

If the string were “ item1 , test, item1, test, item1;”, only the first “item1” (not “test”) is matched with your first regex. Is that consistent with your understanding of the question? – Cary Swoveland Mar 10 '20 at 15:30
@CarySwoveland OP needs validation. If there is a match, the string is invalid. Certainly it is in line. – Wiktor Stribiżew Mar 10 '20 at 15:42
What do you think about replacing [^\w\s] with [\W]? – J-Alex Mar 10 '20 at 16:08
Wiktor, but the OP's task is "Find any duplicate words...". – Cary Swoveland Mar 10 '20 at 16:14
@J-Alex `\W` matches a space, too. – Wiktor Stribiżew Mar 10 '20 at 16:15
@CarySwoveland Then the https://regex101.com/r/alvCQw/2 (second regex in the answer) will do this. – Wiktor Stribiżew Mar 10 '20 at 16:16
Yes, impressive! I see you are avoiding consecutive matches. I expect that is because the question has been a moving target. – Cary Swoveland Mar 10 '20 at 16:23

Regex to identify consecutive and non-consecutive duplicate words in multiline text

2 Answers2