0

How would I remove adjacent duplicate words in a string. For example 'Hey there There' -> 'Hey there'

Right leg
  • 16,080
  • 7
  • 48
  • 81
user1655130
  • 419
  • 1
  • 3
  • 13
  • 2
    https://stackoverflow.com/questions/7794208/how-can-i-remove-duplicate-words-in-a-string-with-python if you want no duplicate words at all... Or do you only want to remove adjacent duplicates? – ChrisOram Jul 22 '21 at 07:57
  • These words are not adjacent though – user1655130 Jul 22 '21 at 07:58

4 Answers4

10

Using re.sub with a backreference we can try:

inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

The regex pattern used here says to:

(\w+)  match and capture a word
[ ]    followed by a space
\1     then followed by the same word (ignoring case)

Then, we just replace with the first adjacent word.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • What does r mean above? – user1655130 Jul 22 '21 at 08:06
  • @user1655130 An `r` preceding a Python string indicates that it is a _raw_ string. We use raw strings because it can make it easier to write regex, avoiding escaping. – Tim Biegeleisen Jul 22 '21 at 08:06
  • from a learning perspective - how would you do this with recursion? – user1655130 Jul 22 '21 at 10:00
  • I suggest opening a new question, as using some kind of recursive approach is very different from my current answer (but maybe I can post _another_ answer). – Tim Biegeleisen Jul 22 '21 at 10:01
  • Unfortunately, it wont let me ask a similar question. Thanks for your help – user1655130 Jul 22 '21 at 14:06
  • `def removeConsecutiveDuplicateWors(s): st = s.split() if len(st) < 2: return " ".join(st) if st[0] != st[1]: nw = ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:])) return nw return removeConsecutiveDuplicateWors(" ".join(st[1:])) string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?' print(removeConsecutiveDuplicateWors(string)) ` – Farhad Kabir Nov 03 '22 at 19:41
3
inp = 'Hey there There'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there

inp = 'Hey there eating?'
output = re.sub(r'\b(\w+) \1\b', r'\1', inp, flags=re.IGNORECASE)
print(output)  # Hey there eating?

\b ensures word boundary and captures the entire word instead of character. The second test case ("Hey there eating?") does not work with https://stackoverflow.com/a/68481181/8439676 answer given by Tim Biegeleisen.

0

Remove adjacent duplicate words recursively

   def removeConsecutiveDuplicateWors(s):
        st = s.split()
        if len(st) < 2:
            return " ".join(st)
        if st[0] != st[1]:
            nw =  ("".join(st[0])) +" "+ removeConsecutiveDuplicateWors(" ".join(st[1:]))
            return nw
        return removeConsecutiveDuplicateWors(" ".join(st[1:]))
      
    
    string = 'I am a duplicate duplicate word in a sentence. How I can be be be removed?'
    print(removeConsecutiveDuplicateWors(string))  

output : I am a duplicate word in a sentence. How I can be removed?

Farhad Kabir
  • 61
  • 2
  • 5
0

Rohit Sharma's answer should be accepted, as it does in fact take word boundaries into account. The original answer would incorrectly change Hey there eating to Hey thereating

Alternatively, one could use the following regex (which will produce a slightly different output in some scenarios; see examples below):

my_output = re.sub(r'\b(\w+)(?:\W+\1\b)+', r'\1', my_input, flags=re.IGNORECASE)

Example 1:

INPUT: Buying food food in the supermarket

ROHITS VERSION OUTPUT: Buying food in the supermarket

ABOVE VERSION OUTPUT: Buying food in the supermarket

Example 2:

INPUT: Food: Food and Beverages

ROHITS VERSION OUTPUT: Food: Food and Beverages (unchanged)

ABOVE VERSION OUTPUT: Food and Beverages

Explanation:

“\b”: A word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.

“\w+” A word character: [a-zA-Z_0-9]

“\W+”: A non-word character: [^\w]

“\1”: Matches whatever was matched in the 1st group of parentheses, which in this case is the (\w+)

“+”: Match whatever it's placed after 1 or more times

Credits:

I adapted this code to Python but it originates from this geeksforgeeks.org post

H3lix
  • 1
  • 1