1

I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.

What is required:

Find any duplicate words (consecutive and non-consecutive) in the multiline file.

// Example_1 (duplicate 'test'):
item1  , test, item3   ;
item4,item5;
test , item6;

// Example_2 (duplicate 'test'):
item1  , test, test   ;
item2,item3;

I've tried to apply the (\w+)(s*\W\s*\w*)*\1 pattern, which doesn't catch duplicate properly.

J-Alex
  • 6,881
  • 10
  • 46
  • 64
  • Do you consider duplicates even if they are separated with multiple non-word characters? Like `word \n\t--- word`? Or should there only be strictly `[whitespace*][nonword][whitespace*]` between words that are checked for duplication? – Wiktor Stribiżew Mar 10 '20 at 14:22
  • 1
    By non-consecutive, do you mean `item1, item1` should not be matched? – Chris Clayton Mar 10 '20 at 14:31
  • @Alex See my updated answer that won't match if there are consecutive dupes – Wiktor Stribiżew Mar 10 '20 at 14:44
  • Suppose the string were `”a, a, a”`. Is only the first word to matched or are both the first and second words to be matched (much easier)? – Cary Swoveland Mar 10 '20 at 15:09
  • @Cary Swoveland In the perfect case, I would like to match the first duplicate, because it will make the error handling (need to report the position of the first duplicate) much easier. – J-Alex Mar 10 '20 at 15:11
  • I don't think a regex is the right tool for this task. Instead, create a hash whose keys are unique words in the text and whose keys are the number of occurrences of the word in the text. Then select those keys in the hash whose values are greater than one. – Cary Swoveland Mar 10 '20 at 16:10
  • ...as done [here](https://stackoverflow.com/questions/26282009/how-to-count-the-number-of-occurrences-of-each-word), for example. (It's a one-liner in Ruby: `File.read(fname).split.tally.select { |_,v| v > 1 }.keys`). – Cary Swoveland Mar 10 '20 at 16:38
  • So it is a dupe of [this thread](https://stackoverflow.com/a/51190570/3832970) that deals with both consecutive and non-consecutive duplicate words. – Wiktor Stribiżew Mar 11 '20 at 11:19
  • @Wiktor Stribiżew with all the respect, the OP's target is to find consecutive duplicates (which is quite frequent case), and searching one of the answers that accidentally covers something additional is like searching a needle in a haystack. Moreover, the thread doesn't work with multiline. – J-Alex Mar 11 '20 at 11:39
  • Multiline matching is a [solved issue](https://stackoverflow.com/questions/159118/how-do-i-match-any-character-across-multiple-lines-in-a-regular-expression), too. – Wiktor Stribiżew Mar 11 '20 at 11:44

2 Answers2

8

You may use this regex with mode DOTALL (single line):

(?s)(\b\w+\b)(?=.*\b\1\b)

RegEx Demo

RegEx Details:

  • (?s): Enable DOTALL mode
  • (\b\w+\b): Match a complete word and capture it in group #1
  • (?=.*\b\1\b): Lookahead to assert that we have back-reference \1 present somewhere ahead. \b is used to make sure we match exact same word again.

Additionally:

Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:

(?s)(\b\w+\b)(?!\W+\1\b)(?=.*\b\1\b)

RegEx Demo 2

There is one extra negative lookahead assertion here to make sure we don't match consecutive repeats.

  • (?!\W+\1\b): Negative lookahead to fail the match for consecutive repeats.
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

You may use

\b(\w+)\b(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b

See the regex demo

Details

  • \b(\w+)\b - Group 1: one or more word chars as a whole word
  • (?:\s*[^\w\s]\s*\w+)+ - 1 or more occurrences of:
    • \s* - 0+ whitespaces
    • [^\w\s] - 1 char other than a word and whitespace char
    • \s* - 0+ whitespaces
    • \w+ - 1+ word chars
  • \s* - 0+ whitespaces
    • [^\w\s] - 1 char other than a word and whitespace char
    • \s* - 0+ whitespaces
  • \b\1\b - the same value as in Group 1 as whole word.

To only match the word, put the second part of the regex into a positive lookahead:

\b(\w+)\b(?=(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b)
         ^^^                                        ^

See this regex demo.

Java regex variable declaration:

String regex = "\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";

To make it fully Unicode aware add (?U):

String regex = "(?U)\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563