Delete the repetition of a specific word in a row

Question

For example I have a string:

my_str = 'my example example string contains example some text'

What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:

my example string contains example some text

I tried next code:

import re
my_str = re.sub(' example +', ' example ', my_str)

or

my_str = re.sub('\[ example ]+', ' example ', my_str)

But it doesn't work. I know there are a lot of questions about re, but I still can't implement them to my case correctly.

The best thing you can do is to read a regex tutorial, something basic should suffice. — Casimir et Hippolyte, Feb 16 '18 at 12:54
A problem that requires the use of word boundaries with dynamic values is not so easy to solve with a regex tutorial. Besides, the [Regular Expression For Consecutive Duplicate Words](https://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words) does not contain the Python implementation which is tricky for those who are not familiar with the best practice of using raw string literals for regex patterns. — Wiktor Stribiżew, Feb 16 '18 at 13:57
@WiktorStribiżew: Solving the general case isn't indeed so easy, but the question is about deleting a *specific* repeated word: `\bexample(?:\s+example)+\b` that is relatively simple, and building it dynamically using a capture group and a back-reference is almost over-engineering if you consider the well known formatted strings widely used in Python. The problem is that the asker lost more time writing a question and waiting for an answer on a topic than reading a tutorial on this topic. Also, he didn't do any search (see the two patterns and the evasive: *"I know there are...blah"*). — Casimir et Hippolyte, Feb 16 '18 at 17:19
@WiktorStribiżew: That said, my first comment is more an advice than a reproach. — Casimir et Hippolyte, Feb 16 '18 at 17:19

Wiktor Stribiżew · Accepted Answer · 2018-02-16T13:17:56.450

You need to create a group and quantify it:

import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text

# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)

See the Python demo

I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.

See the regex demo here:

\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern): the example word
(?:\s+\1)+ - 1 or more occurrences of
- \s+ - 1+ whitespaces
- \1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).

Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.

@Mikhail_Sam I added code to handle situations when a word is a dynamic value and even in cases when the "word" might start/end with a non-word char. — Wiktor Stribiżew, Feb 16 '18 at 13:17

Srdjan M. · Answer 2 · 2018-02-16T13:11:17.277

2

Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1

Details:

\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.

Python code:

text = 'my example example string contains example some text'

text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)

Output:

my example string contains example some text

Code demo

edited Feb 16 '18 at 13:11

answered Feb 16 '18 at 12:57

Srdjan M.

3,310
3
13
34

As I understand it remove ANY word duplicates? I'm not sure at current moment, what I need any, but only specific maybe :) – Mikhail_Sam Feb 16 '18 at 13:05

score 1 · Answer 3 · answered Feb 16 '18 at 12:58

1

You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.

>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'

answered Feb 16 '18 at 12:58

Joe Iddon

20,101
7
33
54

Actually I need to implement it for every string in text, so one row answer is preferable :) anyway I like python-only solution! Thank you! – Mikhail_Sam Feb 16 '18 at 13:06
@Mikhail_Sam What do you mean by "for every string in text"? – Joe Iddon Feb 16 '18 at 13:10
I mean I have list of strings. So will wrap it with the loop (or using list comprehension). So I afraid using this method would be bulky – Mikhail_Sam Feb 16 '18 at 13:26
@Mikhail_Sam I see, well in that case, you could create a new list of the words and then do the `' '.join(...)` afterwards. Or, if the efficiency isn't too important, then you can simply replace the two occurrences of `words` *with* `my_str.split()` – Joe Iddon Feb 16 '18 at 13:36

score -1 · Answer 4 · answered Feb 16 '18 at 12:57

-1

Why not use the .replace function:

my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")

answered Feb 16 '18 at 12:57

Gareth O'Connor

60
8

the question is about the general case – Joe Iddon Feb 16 '18 at 12:58
They can be three or four or any in a row – Mikhail_Sam Feb 16 '18 at 12:59

Delete the repetition of a specific word in a row

4 Answers4