Removal of adjacent Duplicate word/phrase from string having accented characters

Question

I am trying to remove duplicate word / phrases from string.

For example if I have below string

"normalement on on on va, on va diviser, générique générique générique l'explication, générique l'explication détaille, détaille"

I wanted to remove duplicate phrase "on va" after , and "générique l'explication" after , in above string, also duplicate consecutive single word "on" and "générique". Tried below two approach but seems it is working on single word when it will be without any punctuation

>>> import re
>>> s = "normalement on on on va, on va diviser, générique générique l'explication, générique l'explication détaille, détaille"
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
"normalement on va, on va diviser, générique l'explication, générique l'explication détaille, détaille"

>>> sen="normalement on on on va, on va diviser, générique générique l'explication, générique l'explication détaille, détaille"
>>> re.sub(r"\b([a-zA-z àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ']+\s *)\1{1,}", '\\1', sen, flags=re.IGNORECASE)
"normalement on va, on va diviser, générique l'explication, générique l'explication détaille, détaille"

Can anyone help me in this and advice how I can remove adjacent duplicate word/phrases appearing with punctuation and without punctuation.

Isn't [that](https://stackoverflow.com/questions/76095483) what you are looking for? — markalex, Jun 20 '23 at 17:06

SanguineL · Accepted Answer · 2023-06-21T12:13:39.063

re.sub(r"\b(\w+(\s\w+)?)\b(?:.*?)(\b\1\b)", "\\1", sen, flags=re.IGNORECASE)

This should do what you want. It matched the one string you shared.

Update:

(after @markalex's helpful comment.)

The previous regex would catch any duplicates, even if they were at complete opposites of the string being checked. Here is an updated version.

re.sub(r"(\b[a-zA-zàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ']+(?:\s[a-zAzàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ']+)?\b)(?:\W*)(\b\1\b)", "\\1", sen, flags=re.IGNORECASE)

Explanation:

(                                             #Begin 1st Capture Group
 \b                                           #Word Boundary
  [a-zA-zàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ']+  #Any of the characters you want, repeated
  (?:                                         #Begin Non-Capture Group, for additional word
   \s                                         #Whitespace
   [a-zA-zàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ']+ #Any of the characters you want, repeated
  )?                                          #End Non-Capture Group, Allow 0 or 1
 \b                                           #Word Boundary
)                                             #End 1st Capture Group

(?:                                           #Begin Non-Capture Group
 \W*                                          #Match any number of non-alphanumeric characters
)                                             #End Non-Capture Group

(                                             #Begin 2nd Capture Group
 \b                                           #Word Boundary
  \1                                          #Match 1st Capture Group
 \b                                           #Word Boundary
)                                             #End 2nd Capture Group

This will result in deletion of any repeated word anywhere in input with all words in-between, like [`word1 other words word1]`(https://regex101.com/r/S9tPJk/1). — markalex, Jun 20 '23 at 17:17
@markalex Ah I see. That is unfortunate, as it allowed for the use of `\w` in the line. I will update my answer. — SanguineL, Jun 20 '23 at 17:44

score 0 · Answer 2 · answered Jun 20 '23 at 23:03

You can use the following pattern, with the re.finditer function.

Subsequently, you'll need to check if the match contains a comma, in which you'll need to use a separate str.replace statement.

I could not think of a way to capture the comma.

([^ ]+ [^ ]+|[^ ]+),? \1

string = 'normalement on on on va, on va diviser, générique générique générique l''explication, générique l''explication détaille, détaille'
for match in re.finditer(r'([^ ]+ [^ ]+|[^ ]+),? \1', string):
    if ',' in match.group():
        string = string.replace(match.group(), match.group(1) + ',')
    else:
        string = string.replace(match.group(), match.group(1))

Output

normalement on on va, diviser, générique générique lexplication, détaille,

Removal of adjacent Duplicate word/phrase from string having accented characters

2 Answers2

Update: