1

For example I have a string:

my_str = 'my example example string contains example some text'

What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:

my example string contains example some text

I tried next code:

import re
my_str = re.sub(' example +', ' example ', my_str)

or

my_str = re.sub('\[ example ]+', ' example ', my_str)

But it doesn't work. I know there are a lot of questions about re, but I still can't implement them to my case correctly.

Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
  • 1
    The best thing you can do is to read a regex tutorial, something basic should suffice. – Casimir et Hippolyte Feb 16 '18 at 12:54
  • A problem that requires the use of word boundaries with dynamic values is not so easy to solve with a regex tutorial. Besides, the [Regular Expression For Consecutive Duplicate Words](https://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words) does not contain the Python implementation which is tricky for those who are not familiar with the best practice of using raw string literals for regex patterns. – Wiktor Stribiżew Feb 16 '18 at 13:57
  • @WiktorStribiżew: Solving the general case isn't indeed so easy, but the question is about deleting a *specific* repeated word: `\bexample(?:\s+example)+\b` that is relatively simple, and building it dynamically using a capture group and a back-reference is almost over-engineering if you consider the well known formatted strings widely used in Python. The problem is that the asker lost more time writing a question and waiting for an answer on a topic than reading a tutorial on this topic. Also, he didn't do any search (see the two patterns and the evasive: *"I know there are...blah"*). – Casimir et Hippolyte Feb 16 '18 at 17:19
  • @WiktorStribiżew: That said, my first comment is more an advice than a reproach. – Casimir et Hippolyte Feb 16 '18 at 17:19

4 Answers4

3

You need to create a group and quantify it:

import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text

# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)

See the Python demo

I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.

See the regex demo here:

  • \b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
  • (example) - Group 1 (referred to with \1 from the replacement pattern): the example word
  • (?:\s+\1)+ - 1 or more occurrences of
    • \s+ - 1+ whitespaces
    • \1 - a backreference to the Group 1 value, that is, an example word
  • \b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).

Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @Mikhail_Sam I added code to handle situations when a word is a dynamic value and even in cases when the "word" might start/end with a non-word char. – Wiktor Stribiżew Feb 16 '18 at 13:17
2

Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1

Details:

  • \b Assert position at a word boundary
  • \w Matches any word character (equal to [a-zA-Z0-9_])
  • \s Matches any whitespace character
  • + Matches between one and unlimited times
  • \1 Group 1.

Python code:

text = 'my example example string contains example some text'

text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)

Output:

my example string contains example some text

Code demo

Srdjan M.
  • 3,310
  • 3
  • 13
  • 34
  • As I understand it remove ANY word duplicates? I'm not sure at current moment, what I need any, but only specific maybe :) – Mikhail_Sam Feb 16 '18 at 13:05
1

You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.

>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'
Joe Iddon
  • 20,101
  • 7
  • 33
  • 54
  • Actually I need to implement it for every string in text, so one row answer is preferable :) anyway I like python-only solution! Thank you! – Mikhail_Sam Feb 16 '18 at 13:06
  • @Mikhail_Sam What do you mean by "for every string in text"? – Joe Iddon Feb 16 '18 at 13:10
  • I mean I have list of strings. So will wrap it with the loop (or using list comprehension). So I afraid using this method would be bulky – Mikhail_Sam Feb 16 '18 at 13:26
  • @Mikhail_Sam I see, well in that case, you could create a new list of the words and then do the `' '.join(...)` afterwards. Or, if the efficiency isn't too important, then you can simply replace the two occurrences of `words` *with* `my_str.split()` – Joe Iddon Feb 16 '18 at 13:36
-1

Why not use the .replace function:

my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")