Regex to find inconsecutive duplicate words (i.e. occurs more than once in a string)

Question

What is a regex which finds all instances of all words which appear more than once in a string (not necessarily appearing consecutively)?

For example, in the string:

How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck all the wood he could chuck if a woodchuck could chuck wood.

It would find every instance of duplicate word(s); in the sample above, it would find the following words: "wood","could","a","woodchuck","chuck","if".

I have scoured the internet for such a regex to no avail. One would think this would be what all of the questions regarding "finding a duplicate using regex" on SO would be talking about, but they all only talk about adjacent words like "the the".

You may use `\b(\w+)\b(?=.*\1)` and filter out the duplicates. Here's a [demo](https://regex101.com/r/jPJkw8/1). — 41686d6564 stands w. Palestine, Sep 17 '20 at 02:54
@41686d6564 No. I was asking when the words are not right next to each other. — Sam, Sep 17 '20 at 03:10
Please do not spam languages unrelated to the question. You've made it clear in the comments of a now-deleted answer you want a regex. If you wish to specify an engine, fine, but specifying 4 is not acceptable. — ikegami, Sep 17 '20 at 03:49
@ikegami you are absolutely right. Thank you for the feedback. I am still learning how to ask a good SO question. — Sam, Sep 17 '20 at 03:52
@ikegami do you think adding the language-agnostic tag effectively conveys that any regex engine works, or just diverts the focus of the question from pure regexes? — Sam, Sep 17 '20 at 03:54
@41686d6564, That misses the last instance of each duplicate words, and it fails (false positive) for `a banana` (though the latter is easily fixed). — ikegami, Sep 17 '20 at 04:22
@ikegami It wasn't clear initially that the OP wanted to match all the duplicates. You're right about the second part though. It should've been `\b\1\b` instead of `\1`. — 41686d6564 stands w. Palestine, Sep 17 '20 at 04:41
"find words in a string which appear more than once in the string" isn't unclear. — ikegami, Sep 17 '20 at 04:43
This question should not have been closed as a duplicate of the provided link. While one of the answers on the linked page indeed claims that it can find _non-consecutive_ duplicates (the question asks for consecutive ones), that answer is wrong in suggesting a simple lookahead. For why that is wrong see @ikegami's answer here. Voted to reopen. — zdim, Sep 19 '20 at 06:27
While it might seem that this question is similar enough to [this one](https://stackoverflow.com/q/2823016/589924), they are quite different. This question requires detecting both `A`s and `B`s as duplicates in `A B A B`, so a fundamentally different approach than the ones used to answer the linked question is needed. Question re-opened. — ikegami, Sep 19 '20 at 07:38

ikegami · Accepted Answer · 2020-10-30T02:16:19.910

You would need the following:

\b\w+\b
(?: (?= .* \b(\1)\b )
|   (?<= \b\1\b .* \1 )
)

(Make sure that . can match any character in the engine you are using. Adjust as needed.)

You haven't specified a regex engine, but I can't think of any that support variable-width lookbehinds.^[1] Yet, this is required to achieve what you want.

It's also horribly slow, having O(N^2) time in terms of words.^[2]

Ok, someone showed that Variable-Length Lookbehinds: actually possible in Perl/PCRE! They used recursion to step back one character at a time. Have fun with that.

One would normally use two passes, one to find the duplicates, and one to do the "finding".

my %seen;
my @dups = grep ++$seen{$_} == 2, $file =~ /\w+/g;
my $alt = join "|", @dups;
$file =~ s/\b($alt)\b/<$&>/g;

This is O(N) in terms of words.

Technically, starting in Perl 5.30, lookbehinds "can handle variable lengths from 1 to 255 characters as an experimental feature." This is far too small for the OP, who spoke of in terms of GB in a now-deleted comment.

Imagine you have a document with N words, all different.

Word 1 needs to be compared against the N-1 following words and 0 preceding words.
Word 2 needs to be compared against the N-2 following words and 1 preceding word.
...
Word N-1 needs to be compared against the 1 following word and N-2 preceding words.
Word N needs to be compared against the 0 following words and N-1 preceding word.

  O( (N-1)+0 + (N-2)+1 + ... + 1+(N-2) + 0+(N-1) )
= O( [ (N-1)+(N-2)+...+1+0 ] + [ 0+1+...+(N-2)+(N-1) ] )
= O( [ 0+1+...+(N-2)+(N-1) ] * 2 )
= O( 1+...+(N-2)+(N-1) )               # Constant factors irrelevant in O()
= O( (N-1) * ((N-1)+1) / 2 )           # 1+2+..x == x*(x+1)/2
= O( (N-1) * N / 2 )
= O( (N-1) * N )                       # Constant factors irrelevant in O()
= O( N^2 - N )
= O( N^2 )                             # An N term is subsumed by a N^2 term

What do you mean "variable-width lookbehinds"? Where would the variable width go? — Sam, Sep 17 '20 at 04:04
`.*` matches a variable number of characters, `(?<=...)` is a lookbehind. So `(?<=\b\1\b.*\b\1\b)` is an example of a variable-width lookbehind. — ikegami, Sep 17 '20 at 04:14
ok, so variable-width lookbeghind is technically possible in some regex engine. See updated answer. — ikegami, Sep 17 '20 at 04:39
Why does the look behind have to have two instances of the match? — Sam, Sep 17 '20 at 10:57
Because the lookbehind starts after the word. The second `\1` walks back to the start of the word. (Removed the useless `\b` from around the second `\1`) — ikegami, Sep 17 '20 at 18:08

zdim · Answer 2 · 2021-03-01T04:38:56.997

One way to do the key part with regex is by using a feature that allows code execution inside regex, to build a frequency hash (map, dictionary). That can then be used to select words that repeat.

Doing it all in regex alone, not ever using any language or tool, isn't really possible (except perhaps using recursion if supported). The rest of this post employs basics of a programming language, in the context of the regex feature that allows code execution in regex.

A feature that I think is available to most engines is to run code in order to prepare the replacement string. I use Perl for the following short samples. What is of interest here is done using a side effect, what of course isn't optimal (and usually results in clumsy looking code).

Either run the normal substitution and put back the matched word

$string =~ s/(\w+)/++$freq{$1}; $1/eg;

or use the "non-destructive" substitution, which doesn't change the target (and returns the changed string or the original if the match failed)

my $toss = $string =~ s/(\w+)/++$freq{$1}/egr;

The returned string is unneeded for the task as described. In both cases we run a substitution on each word when this isn't what we actually need.

Then print keys (words) with frequencies larger than 1

foreach my $word (keys %freq) { say $word if $freq{$word} > 1 }

The regex matchs "words" per \w character class; adjust as suitable for your need.

Altogether, since this is a tricky task for a regex I'd recommend to split the string into words and count off duplicates using your language's features, rather than push the regex. Any language worth its salt can do this rather elegantly and efficiently.

With Perl, another way would be to use the embedded code construct, that allows code to run in the matching part. As far as I know that is not available in other languages, save for some libraries.

One can run code in the matching part like so

my @matches = $string =~ /(\w+)(?{++$freq{$1}})/g;

where the construct (?{code}) will execute embedded code, here building the frequency hash. Using this feature (safely) requires close reading of documentation.

The @matches above has all words, and is not of interest in the problem as stated; it is used here to put the regex's match operator in the "list context" so that the search continues through the string via the /g modifier.

woops, I think I only saw the bottom of your answer. – ikegami Sep 22 '20 at 21:41 — ikegami, Sep 22 '20 at 21:41

Polar Bear · Answer 3 · 2020-09-17T04:23:01.420

1

Perhaps following demo code is easy to understand

use strict;
use warnings;
use feature 'say';

my $text = do { local $/, <DATA> };

my(%count,@found,$regex);

$count{$_}++ for split '[ .,?]', $text;
$count{$_} > 1 && push @found, $_ for keys %count;
$regex  = join('|',@found);

say 'Before: ' . $text;
$text =~ s/\b($regex)\b/<$1>/g;
say 'After: ' . $text;

__DATA__
How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck all the wood he could chuck if a woodchuck could chuck wood.

Output

Before: How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck all the wood he could chuck if a woodchuck could chuck wood.
After: How much <wood> <could> <a> <woodchuck> <chuck> <if> <a> <woodchuck> <could> <chuck> <wood>? A <woodchuck> would <chuck> all the <wood> he <could> <chuck> <if> <a> <woodchuck> <could> <chuck> <wood>.

edited Sep 17 '20 at 04:23

answered Sep 17 '20 at 03:24

Polar Bear

6,762
1
5
12

@ikegami -- I want it to highlight every instance of: "wood", "could", "a", "woodchuck", "chuck", "if". In code sample `key` words wrapped into `<>`. – Polar Bear Sep 17 '20 at 03:27
This seems like it just turns the regex into ""wood|could|a|woodchuck|chuck|if". Does it not? This would be entirely unhelpful because I do not know what I am looking for. I do not have those words ahead of time. – Sam Sep 17 '20 at 03:28
@Sam - then in your question you should ask "How to **highllight** words which duplicate in a text. And state if first instance of the word should be highlighted or not. In present form your question is not clear what your final goal. You could give sample of input and processed output to make it transparent. – Polar Bear Sep 17 '20 at 03:32
@Sam -- see my updated code, where we do not know in advance what words are repeated. – Polar Bear Sep 17 '20 at 03:42
What language is this written in? – Sam Sep 17 '20 at 04:19
Great answer. Would you mind taking away a -1 from the question so that others can find it? – Sam Sep 17 '20 at 04:27

Regex to find inconsecutive duplicate words (i.e. occurs more than once in a string)

3 Answers3