3

I am trying to write a function that removes consecutive duplicate words within a string. It's vital that one any matches found by the regular expression remains. In other words...

A very very very dirty dog

should become...

A very dirty dog

I have a regular expression that seems to work well (based on this post)

(\b\S+\b)(($|\s+)\1)+

However I'm not sure how to use preg_replace (or if there's a better function) to implement this. Right now I have it deleting all matching repeated words without leaving one copy of the word intact. Can I parse a variable or special instruction to it to keep a match ?

I have this currently...

$string=preg_replace('/(\b\S+\b)(($|\s+)\1)+/', '', $string);
Community
  • 1
  • 1
AdamJones
  • 652
  • 1
  • 15
  • 38

2 Answers2

4

You may use a regex like \b(\S+)(?:\s+\1\b)+ and replace with $1:

$string=preg_replace('/\b(\S+)(?:\s+\1\b)+/i', '$1', $string);

See the regex demo

Details:

  • \b(\S+) - Group 1 capturing one or more non-whitespace symbols that are preceded with a word boundary (maybe \b(\w+) would suit better here)
  • (?:\s+\1\b)+ - 1 or more sequences of:
    • \s+ - 1 or more whitespaces
    • \1\b - a backreference to the value stored in Group 1 buffer (the value must be a whole word)

The replacement pattern is $1, the replacement backreference that refers to the value stored in Group 1 buffer.

Note that /i case insensitive modifier will make \1 case insensitive, and I have a dog Dog DOG will result in I have a dog.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks Wiktor! To clarify, my previous expression was also capturing words with different cases. So that may be useful for some and worth putting on the record. – AdamJones Mar 04 '17 at 22:58
  • I believe my original regex works with different cased words. So "very Very very" would also be caught – AdamJones Mar 04 '17 at 23:01
  • Ah ok... I just tried the working demo and it doesn't seem to do that – AdamJones Mar 04 '17 at 23:03
  • You must add the case insensitive modifier. – Wiktor Stribiżew Mar 04 '17 at 23:04
  • 1
    Thanks so much for your help with this Wiktor! – AdamJones Mar 07 '17 at 21:30
  • unfortunately works only for Latin alphabet, e.g. try it on phrase 'молоко молоко хлеб колбаса ' , my php version is 5.6 – Tebe Sep 21 '18 at 17:49
  • 1
    @Tebe Use `preg_replace('/\b(\S+)(?:\s+\1\b)+/u', '$1', "молоко молоко хлеб колбаса");`. Also, `'/\b(\p{L}+)(?:\s+\1\b)+/u'` will work, too. – Wiktor Stribiżew Sep 21 '18 at 17:52
  • wow, wow, looks like a black magic. Especially the second approach in comment. As far as I understand " u " modifier is used to make it work with unicode. So the universal solution would be preg_replace('/\b(\S+)(?:\s+\1\b)+/iu', '$1', $text) . Thanks! – Tebe Sep 21 '18 at 18:03
0
<?php
$text ='one one, two three, two';
$result_text = preg_replace("/\b(\w+)\s+\\1\b/i", "$1", $text);
echo "Result Text: ".$result_text; //one, two three, two
?>

Try this. It should return one copy intact.

Resheil Agarwal
  • 155
  • 1
  • 11