Remove adjacent duplicate words in a string with Python?

Question

How would I remove adjacent duplicate words in a string. For example 'Hey there There' -> 'Hey there'

https://stackoverflow.com/questions/7794208/how-can-i-remove-duplicate-words-in-a-string-with-python if you want no duplicate words at all... Or do you only want to remove adjacent duplicates? — ChrisOram, Jul 22 '21 at 07:57

Tim Biegeleisen · Accepted Answer · 2021-07-22T08:01:07.460

10

Using re.sub with a backreference we can try:

inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

The regex pattern used here says to:

(\w+)  match and capture a word
[ ]    followed by a space
\1     then followed by the same word (ignoring case)

Then, we just replace with the first adjacent word.

edited Jul 22 '21 at 08:01

answered Jul 22 '21 at 07:58

Tim Biegeleisen

502,043
27
286
360

What does r mean above? – user1655130 Jul 22 '21 at 08:06
@user1655130 An `r` preceding a Python string indicates that it is a _raw_ string. We use raw strings because it can make it easier to write regex, avoiding escaping. – Tim Biegeleisen Jul 22 '21 at 08:06
from a learning perspective - how would you do this with recursion? – user1655130 Jul 22 '21 at 10:00
I suggest opening a new question, as using some kind of recursive approach is very different from my current answer (but maybe I can post _another_ answer). – Tim Biegeleisen Jul 22 '21 at 10:01
Unfortunately, it wont let me ask a similar question. Thanks for your help – user1655130 Jul 22 '21 at 14:06
`def removeConsecutiveDuplicateWors(s): st = s.split() if len(st) < 2: return " ".join(st) if st[0] != st[1]: nw = ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:])) return nw return removeConsecutiveDuplicateWors(" ".join(st[1:])) string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?' print(removeConsecutiveDuplicateWors(string)) ` – Farhad Kabir Nov 03 '22 at 19:41

ROHIT SHARMA 16110141 · Answer 2 · 2021-09-08T11:05:25.803

inp = 'Hey there There'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

inp = 'Hey there eating?'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there eating?

\b ensures word boundary and captures the entire word instead of character. The second test case ("Hey there eating?") does not work with https://stackoverflow.com/a/68481181/8439676 answer given by Tim Biegeleisen.

score 0 · Answer 3 · answered Nov 03 '22 at 19:48

Remove adjacent duplicate words recursively

   def removeConsecutiveDuplicateWors(s):
        st = s.split()
        if len(st) < 2:
            return " ".join(st)
        if st[0] != st[1]:
            nw =  ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:]))
            return nw
        return removeConsecutiveDuplicateWors(" ".join(st[1:]))
      
    
    string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?'
    print(removeConsecutiveDuplicateWors(string))

output : I am a duplicate word in a sentence. How I can be removed?

H3lix · Answer 4 · 2023-03-10T16:23:36.160

Rohit Sharma's answer should be accepted, as it does in fact take word boundaries into account. The original answer would incorrectly change Hey there eating to Hey thereating

Alternatively, one could use the following regex (which will produce a slightly different output in some scenarios; see examples below):

my_output = re.sub(r'\b(\w+)(?:\W+\1\b)+', r'\1', my_input, flags=re.IGNORECASE)

Example 1:

INPUT: Buying food food in the supermarket

ROHITS VERSION OUTPUT: Buying food in the supermarket

ABOVE VERSION OUTPUT: Buying food in the supermarket

Example 2:

INPUT: Food: Food and Beverages

ROHITS VERSION OUTPUT: Food: Food and Beverages (unchanged)

ABOVE VERSION OUTPUT: Food and Beverages

Explanation:

“\b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.

“\w+” A word character: [a-zA-Z_0-9]

“\W+”: A non-word character: [^\w]

“\1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (\w+)

“+”: Match whatever it's placed after 1 or more times

Credits:

I adapted this code to Python but it originates from this geeksforgeeks.org post

Remove adjacent duplicate words in a string with Python?

4 Answers4