Regular expression for duplicate words

Question

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:

Paris in the the spring.

Not that that is related.

Why are you laughing? Are my my regular expressions THAT bad??

Is there a single regular expression that will match ALL of the bold strings above?

@poly: That was no "accusation", but a calm, normal question that perfectly can take a "no" as an answer. @Joshua: Yes, some people (not too few) let this site do their homework for them. But asking homework questions is not a bad thing to do on SO, when they are tagged as such. Usually the style of the answers changes from "here is the solution" to "here are some things you have not thought about", and that is a good thing. Somebody has to try and keep up the distinction, in his case it was me, and elsewhere "other people" do the same thing. That's all. — Tomalak, May 13 '10 at 06:39
Hope to never see a question like "This sounds a bit like a workplace question. Is it?" and then people will argue if stack overflow is doing someone's job. — marcio, Dec 10 '14 at 21:16
@Joshua +1 with respect to the regex solution you accepted, could you please tell me how could I replace the matches (duplicates) by one element of the pair (e.g., `not that that is related` -> `not that is related`)? Thanks in advance — Antoine, Apr 20 '16 at 09:53
@Joshua I think I found the solution: I should replace by `\1`! — Antoine, Apr 20 '16 at 09:59
This solution handle consecutive duplicate words, what about the more generic situation: when the number of duplicated words is greater than 2?, for example: "Not that **that that** is related". — David Leal, Feb 15 '17 at 00:19
[This answer](https://stackoverflow.com/a/51190570/3832970) deals with both consecutive and non-consecutive duplicate words. — Wiktor Stribiżew, Mar 11 '20 at 11:18
careful though: "not that that's related. But..." could be proper grammar. Or at least usage. But that that in other contexts is perfectly correct (...so that that nation can exist...) — Neil S3ntence, Jul 12 '20 at 18:54

score 222 · Accepted Answer · edited Mar 06 '22 at 03:59

222

Try this regular expression:

\b(\w+)\s+\1\b

Here \b is a word boundary and \1 references the captured match of the first group.

Regex101 example here

edited Mar 06 '22 at 03:59

mustafa candan

567
5
16

answered May 12 '10 at 21:55

Gumbo

643,351
109
780
844

2

Makes me wonder; is it possible to do `\0` too? (Where `\0` is the whole regex, up to the current point OR where `\0` refers to the whole regex) – Pindatjuh May 12 '10 at 22:37
@Pindatjuh: No, I don’t think so because that sub-match would also be part of the whole match. – Gumbo May 12 '10 at 22:40
At least works on the regex engine used in the Eclipse search/replace dialog. – Chaos_99 May 24 '13 at 12:11
This would treat hyphens etc. as marking a word boundary, e.g. `the the-foo bar`. @Daniel's answer is slightly more correct. – Zephyr was a Friend of Mine Apr 15 '15 at 16:00
5

Just a warning, this does not handle words with apostrophes or (as Noel mentions) hypens. Mike's solution works better in these cases – May 13 '15 at 00:44
3

Moreover, it won't catch triplicates (or more), not when one of the dup/triplicate is at the end of the string – Nico Feb 18 '16 at 20:03
+1 nice solution. Could you tell me how to replace the matches (duplicates) with the first element of the pair (e.g., `and and` should become `and`)? – Antoine Apr 20 '16 at 09:46
Don't know it does not work in Python, the regex looks good to me. When I try to call match function, it always returns `None` – Lucas Huang Jan 20 '18 at 01:28
@LucasHuang Try `re.search`. See [search() vs. match()](https://docs.python.org/3/library/re.html#search-vs-match). – ytu Jun 05 '18 at 02:21
and If I want to find all consecutive words from a particular tag, such as `
bla bla
` how can I integrate this regex formula? – Just Me Apr 22 '19 at 10:47
Doesn't work when the 2nd word is the last word on the line. The regex `\b(\w+)\s+\1$` works in those cases but that doesn't work when the 2nd word is *not* at the end of a line. Any ideas? \[edit\] Found the [answer](https://stackoverflow.com/a/9005999/1052284): `\b(\w+)\s+\1(?:\s|$)` – Mark Jeronimus Aug 04 '19 at 08:25
What If I want to capture any number of repetations of a word? – CKM Jan 08 '23 at 13:28

score 30 · Answer 2 · edited Feb 01 '18 at 03:35

30

I believe this regex handles more situations:

/(\b\S+\b)\s+\b\1\b/

A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

edited Feb 01 '18 at 03:35

mickmackusa

43,625
12
83
136

answered Sep 06 '12 at 23:40

Mike Viens

2,467
3
19
23

Great, works with apostrophes/hyphens/etc. too - thanks! – May 13 '15 at 00:45
for the challenge1 link, what do you place in the replace area to use the grouped word? Tried `\0` but not working. – uptownhr Feb 08 '16 at 20:56
2

It won't catch triplicates (or more), not when one of the dup/triplicate is at the end of the string – Nico Feb 18 '16 at 20:05
@uptownhr You want to use `$1 $2`. But also use different regex `/\b(\S+) (\1)\b/gi`. Here is a link: https://callumacrae.github.io/regex-tuesday/challenge1.html?find=%2F%5Cb(%5CS%2B)%20(%5C1)%5Cb%2Fgi&replace=%241%20%3Cstrong%3E%242%3C%2Fstrong%3E – dsalaj Aug 09 '18 at 07:05
and If I want to find all consecutive words from a particular tag, such as `
bla bla
` how can I integrate this regex formula? – Just Me Apr 22 '19 at 10:47
The `\b` (word boundary metacharacter ) followed by `\s` (a non-word metacharacter) will demand that the last character matched by `\S` MUST be a "word character" (`\w`) -- so this will technically cover more scenarios, but is a little misleading to those who don't know how `\b` behaves. – mickmackusa Nov 03 '22 at 06:39

score 24 · Answer 3 · edited May 03 '22 at 16:40

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.

String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(input);

// Check for subsequences of input that match the compiled pattern
while (m.find()) {
     input = input.replaceAll(m.group(0), m.group(1));
}

Sample Input : Goodbye goodbye GooDbYe

Sample Output : Goodbye

Explanation:

The regex expression:

\b : Start of a word boundary

\w+ : Any number of word characters

(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.

Grouping :

m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe

m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye

Replace method shall replace all consecutive matched words with the first instance of the word.

score 12 · Answer 4 · edited Jul 26 '17 at 11:34

Try this with below RE

\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word

()* Repeating again

public static void main(String[] args) {

    String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";//  "/* Write a RegEx matching repeated words here. */";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);

    Scanner in = new Scanner(System.in);

    int numSentences = Integer.parseInt(in.nextLine());

    while (numSentences-- > 0) {
        String input = in.nextLine();

        Matcher m = p.matcher(input);

        // Check for subsequences of input that match the compiled pattern
        while (m.find()) {
            input = input.replaceAll(m.group(0),m.group(1));
        }

        // Prints the modified sentence.
        System.out.println(input);
    }

    in.close();
}

Niket Pathak · Answer 5 · 2022-05-27T16:35:56.597

11

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.

/\b(\w+)\b(?=.*?\b\1\b)/ig

Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.

Example Source

edited May 27 '22 at 16:35

answered Jul 05 '18 at 11:46

Niket Pathak

6,323
1
39
51

1

Non-consecutive is a bad idea: `"the cat sat on the mat"` -> `" cat sat on the mat"` – Walf Dec 06 '18 at 03:47
3

@Walf True. Nevertheless, there are scenarios where this is intended. (for example: whilst scraping data) – Niket Pathak Dec 06 '18 at 12:01
Why'd you [break your regex again](https://regex101.com/r/XuTNPS/1) after [I corrected it](https://regex101.com/r/Cw8SmI/2)? Did you think I had changed its intent? Even the example you linked doesn't have the mistake. – Walf Dec 06 '18 at 19:38
Yep, it was a mistake, copy pasted the wrong stuff. Intended to copy the one from my example actually. anyway, it now works! so all good! Thanks! – Niket Pathak Dec 07 '18 at 09:45
1

I had a similar use case to remove duplicate characters from a string in java and your solution helped me. Thanks. If anyone else is looking for the code to remove duplicate chars from String in java - s1.replaceAll("(.)(?=.*?\\1)", "") – tanson Jun 11 '21 at 00:18

soulmerge · Answer 6 · 2010-05-13T09:16:02.367

8

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):

(\b\w+\b)\W+\1

edited May 13 '10 at 09:16

answered May 12 '10 at 21:55

soulmerge

73,842
19
118
155

You need something to match the characters *between* the two words, like `\W+`. `\b` won't do it, because it doesn't consume any characters. – Alan Moore May 12 '10 at 22:35
This will potentially result in false-positive matching in cases like `... the these problems...`. This solution is not as reliable as the general structure of Gumbo's pattern which sufficiently implements word boundaries. – mickmackusa Feb 01 '18 at 04:56
and If I want to find all consecutive words from a particular tag, such as `
bla bla
` how can I integrate this regex formula? – Just Me Apr 22 '19 at 10:47

score 5 · Answer 7 · answered Mar 24 '18 at 00:08

5

Here is one that catches multiple words multiple times:

(\b\w+\b)(\s+\1)+

answered Mar 24 '18 at 00:08

synaptikon

699
1
8
16

and If I want to find all consecutive words from a particular tag, such as `
bla bla
` how can I integrate this regex formula? – Just Me Apr 22 '19 at 10:46
I believe that will require HTML parsing. For any given tag that you wish to search, find all tag occurrences inside the HTML, and run this regex one by one on each one. Or if you dont care about where in the HTML does the repetition occur, concatenate all the tag text attributes and run the regex on the concatenated string – synaptikon Apr 24 '19 at 16:13
I find myself the answer `
.*?\b\s+(\w+)\b\K\s+\1\s+\b(?=.*?<\/p>)`
– Just Me Apr 25 '19 at 12:18

score 4 · Answer 8 · answered May 12 '10 at 21:53

4

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

answered May 12 '10 at 21:53

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

14

Though being correct in a strict sense, I believe there is no regex engine in serious use anymore that does not support grouping and back-references. – Tomalak May 12 '10 at 22:35

score 4 · Answer 9 · answered Jul 18 '15 at 01:17

4

This is the regex I use to remove duplicate phrases in my twitch bot:

(\S+\s*)\1{2,}

(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.

\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

answered Jul 18 '15 at 01:17

Neceros

452
4
7

This answer is misleading. It does not hunt duplicates, it hunts substrings with 3 or more occurrences. It is also not very robust because of the `\s*` in the capture group. See this demonstration: https://regex101.com/r/JtCdd6/1 – mickmackusa Feb 01 '18 at 04:10
Furthermore extreme cases (low-frequency text) would produce false positive matches. E.g. `I said "oioioi" that's some wicked mistressship!` on `oioioi` and `sss` – mickmackusa Feb 01 '18 at 07:42

score 3 · Answer 10 · answered Feb 01 '18 at 04:41

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.

Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)

This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).

Specifically:

\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.

*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

score 2 · Answer 11 · answered Apr 24 '13 at 21:04

The example in Javascript: The Good Parts can be adapted to do this:

var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;

\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

score 1 · Answer 12 · edited Feb 01 '18 at 03:58

1

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:

/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")

I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)

First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.

I tried it like this and it worked well:

var s = "here here here     here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result     result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))         
--> here is ahi-ahi joe's the result

edited Feb 01 '18 at 03:58

mickmackusa

43,625
12
83
136

answered Feb 18 '16 at 20:08

Nico

4,248
1
20
19

I'm having trouble rewriting this into PHP, it's vital I get a single copy of the matched duplicate replacing each occurrence of duplicates/triplicates etc. So far I have: preg_replace('/(^|\s+)(\S+)(($|\s+)\2)+/im', '$0', $string); – AdamJones Feb 28 '17 at 16:26
This is the best answer. I just made a tweak to that by adding `\b` to the end like so: `/(^|\s+)(\S+)(($|\s+)\2)+\b/g, "$1$2")` This will then work for situations like this: `the the string String string stringing the the along the the string` will become `the string stringing the along the string` Notice `string stringing`. It gets matched with your answer. Thank you. – Ste Aug 18 '19 at 01:20

score 1 · Answer 13 · answered Nov 08 '21 at 18:58

1

Try this regular expression it fits for all repeated words cases:

\b(\w+)\s+\1(?:\s+\1)*\b

answered Nov 08 '21 at 18:58

MIsmail

55
8

score 1 · Answer 14 · answered Jan 08 '23 at 16:25

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.

The pattern will have a match in:

Paris in the the spring.
Not that that is related.

The pattern will not have a match in:

This is $word word

(?<!\S)(\w+)\s+\1(?!\S)

Explanation

(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location

See a regex101 demo.

To find 2 or more duplicate words:

(?<!\S)(\w+)(?:\s+\1)+(?!\S)

This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.

See a regex101 demo.

Alternatives without using lookarounds

You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.

Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.

Matching 2 duplicate words:

(?:\s|^)((\w+)\s+\2)(?:\s|$)

See a regex101 demo.

Matching 2 or more duplicate words:

(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)

See a regex101 demo.

Mahozad · Answer 15 · 2022-06-09T13:58:59.177

I think another solution would be to use named capture groups and backreferences like this:

.* (?<mytoken>\w+)\s+\k<mytoken> .*/

OR

.*(?<mytoken>\w{3,}).+\k<mytoken>.*/

Kotlin logo Kotlin:

val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)

Java logo Java:

var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);

JavaScript logo JavaScript:

const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);

// OR

const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);

All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

score 0 · Answer 16 · answered Dec 30 '22 at 05:53

You can use this pattern:

\b(\w+)(?:\W+\1\b)+

This pattern can be used to match all duplicated word groups in sentences. :)

Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:

    public String removeDuplicates(String input) {
        var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
        var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
        var matcher = pattern.matcher(input);
        while (matcher.find()) {
            input = input.replaceAll(matcher.group(), matcher.group(1));
        }
        return input;
    }

score -1 · Answer 17 · answered Aug 16 '16 at 15:55

-1

Use this in case you want case-insensitive checking for duplicate words.

(?i)\\b(\\w+)\\s+\\1\\b

answered Aug 16 '16 at 15:55

Neelam

360
4
14

Using the case-insensitive pattern modifier is no use for your pattern. There are no letter ranges for the flag to impact. – mickmackusa Feb 01 '18 at 03:56
This is effectively a duplicate of the accepted answer and adds no value to the page. Please consider removing this answer to reduce page bloat. – mickmackusa Feb 01 '18 at 04:14

Regular expression for duplicate words

17 Answers17

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Linked

Related