Removing consecutive duplicate words in a string

Question

I am trying to write a function that removes consecutive duplicate words within a string. It's vital that one any matches found by the regular expression remains. In other words...

A very very very dirty dog

should become...

A very dirty dog

I have a regular expression that seems to work well (based on this post)

(\b\S+\b)(($|\s+)\1)+

However I'm not sure how to use preg_replace (or if there's a better function) to implement this. Right now I have it deleting all matching repeated words without leaving one copy of the word intact. Can I parse a variable or special instruction to it to keep a match ?

I have this currently...

$string=preg_replace('/(\b\S+\b)(($|\s+)\1)+/', '', $string);

Note there is no point using the `$` in the alternation as `$\1` will never match (you do not even use the multiline modifier. — Wiktor Stribiżew, Mar 04 '17 at 22:58

Wiktor Stribiżew · Accepted Answer · 2017-03-04T23:22:43.260

4

You may use a regex like \b(\S+)(?:\s+\1\b)+ and replace with $1:

$string=preg_replace('/\b(\S+)(?:\s+\1\b)+/i', '$1', $string);

See the regex demo

Details:

\b(\S+) - Group 1 capturing one or more non-whitespace symbols that are preceded with a word boundary (maybe \b(\w+) would suit better here)
(?:\s+\1\b)+ - 1 or more sequences of:
- \s+ - 1 or more whitespaces
- \1\b - a backreference to the value stored in Group 1 buffer (the value must be a whole word)

The replacement pattern is $1, the replacement backreference that refers to the value stored in Group 1 buffer.

Note that /i case insensitive modifier will make \1 case insensitive, and I have a dog Dog DOG will result in I have a dog.

edited Mar 04 '17 at 23:22

answered Mar 04 '17 at 22:53

Wiktor Stribiżew

607,720
39
448
563

Thanks Wiktor! To clarify, my previous expression was also capturing words with different cases. So that may be useful for some and worth putting on the record. – AdamJones Mar 04 '17 at 22:58
I believe my original regex works with different cased words. So "very Very very" would also be caught – AdamJones Mar 04 '17 at 23:01
Ah ok... I just tried the working demo and it doesn't seem to do that – AdamJones Mar 04 '17 at 23:03
You must add the case insensitive modifier. – Wiktor Stribiżew Mar 04 '17 at 23:04
1

Thanks so much for your help with this Wiktor! – AdamJones Mar 07 '17 at 21:30
unfortunately works only for Latin alphabet, e.g. try it on phrase 'молоко молоко хлеб колбаса ' , my php version is 5.6 – Tebe Sep 21 '18 at 17:49
1

@Tebe Use `preg_replace('/\b(\S+)(?:\s+\1\b)+/u', '$1', "молоко молоко хлеб колбаса");`. Also, `'/\b(\p{L}+)(?:\s+\1\b)+/u'` will work, too. – Wiktor Stribiżew Sep 21 '18 at 17:52
wow, wow, looks like a black magic. Especially the second approach in comment. As far as I understand " u " modifier is used to make it work with unicode. So the universal solution would be preg_replace('/\b(\S+)(?:\s+\1\b)+/iu', '$1', $text) . Thanks! – Tebe Sep 21 '18 at 18:03

score 0 · Answer 2 · answered Mar 04 '17 at 23:23

0

<?php
$text ='one one, two three, two';
$result_text = preg_replace("/\b(\w+)\s+\\1\b/i", "$1", $text);
echo "Result Text: ".$result_text; //one, two three, two
?>

Try this. It should return one copy intact.

answered Mar 04 '17 at 23:23

Resheil Agarwal

155
1
11

It is a light version of my solution without more than 1 duplicate word support. – Wiktor Stribiżew Mar 04 '17 at 23:37

Removing consecutive duplicate words in a string

2 Answers2

Linked