How would I remove adjacent duplicate words in a string. For example 'Hey there There' -> 'Hey there'
-
2https://stackoverflow.com/questions/7794208/how-can-i-remove-duplicate-words-in-a-string-with-python if you want no duplicate words at all... Or do you only want to remove adjacent duplicates? – ChrisOram Jul 22 '21 at 07:57
-
These words are not adjacent though – user1655130 Jul 22 '21 at 07:58
4 Answers
Using re.sub
with a backreference we can try:
inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output) # Hey there
The regex pattern used here says to:
(\w+) match and capture a word
[ ] followed by a space
\1 then followed by the same word (ignoring case)
Then, we just replace with the first adjacent word.

- 502,043
- 27
- 286
- 360
-
-
@user1655130 An `r` preceding a Python string indicates that it is a _raw_ string. We use raw strings because it can make it easier to write regex, avoiding escaping. – Tim Biegeleisen Jul 22 '21 at 08:06
-
from a learning perspective - how would you do this with recursion? – user1655130 Jul 22 '21 at 10:00
-
I suggest opening a new question, as using some kind of recursive approach is very different from my current answer (but maybe I can post _another_ answer). – Tim Biegeleisen Jul 22 '21 at 10:01
-
Unfortunately, it wont let me ask a similar question. Thanks for your help – user1655130 Jul 22 '21 at 14:06
-
`def removeConsecutiveDuplicateWors(s): st = s.split() if len(st) < 2: return " ".join(st) if st[0] != st[1]: nw = ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:])) return nw return removeConsecutiveDuplicateWors(" ".join(st[1:])) string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?' print(removeConsecutiveDuplicateWors(string)) ` – Farhad Kabir Nov 03 '22 at 19:41
inp = 'Hey there There'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output) # Hey there
inp = 'Hey there eating?'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output) # Hey there eating?
\b
ensures word boundary and captures the entire word instead of character. The second test case ("Hey there eating?") does not work with https://stackoverflow.com/a/68481181/8439676 answer given by Tim Biegeleisen.
Remove adjacent duplicate words recursively
def removeConsecutiveDuplicateWors(s):
st = s.split()
if len(st) < 2:
return " ".join(st)
if st[0] != st[1]:
nw = ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:]))
return nw
return removeConsecutiveDuplicateWors(" ".join(st[1:]))
string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?'
print(removeConsecutiveDuplicateWors(string))
output : I am a duplicate word in a sentence. How I can be removed?

- 61
- 2
- 5
Rohit Sharma's answer should be accepted, as it does in fact take word boundaries into account. The original answer would incorrectly change Hey there eating
to Hey thereating
Alternatively, one could use the following regex (which will produce a slightly different output in some scenarios; see examples below):
my_output = re.sub(r'\b(\w+)(?:\W+\1\b)+', r'\1', my_input, flags=re.IGNORECASE)
Example 1:
INPUT: Buying food food in the supermarket
ROHITS VERSION OUTPUT: Buying food in the supermarket
ABOVE VERSION OUTPUT: Buying food in the supermarket
Example 2:
INPUT: Food: Food and Beverages
ROHITS VERSION OUTPUT: Food: Food and Beverages
(unchanged)
ABOVE VERSION OUTPUT: Food and Beverages
Explanation:
“\b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.
“\w+” A word character: [a-zA-Z_0-9]
“\W+”: A non-word character: [^\w]
“\1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (\w+)
“+”: Match whatever it's placed after 1 or more times
Credits:
I adapted this code to Python but it originates from this geeksforgeeks.org post

- 1
- 1