7

Suppose I have a string such as

'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'

I want to remove the second occurrence of duplicate phrase without removing other occurrences of its constituent parts, such as the other use of duplicate.

Moreover, I need to remove all potential duplicate phrases, not just the duplicates of some specific phrase that I know in advance.

I have found several posts on similar problems, but none that have helped me solve my particular issue:

I had hoped to adapt the approach from the last link there (re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)) for my purposes, but could not figure out how to do so.

How do I remove all arbitrary duplicate phrases of two or more words from a string in Python?

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • Does the size of these phrases is arbitrary? Can they appear anywhere on the text? – Dani Mesejo Nov 06 '18 at 23:50
  • For example this `'aaa ccc bbb aaa ccc'` Has the duplicate phrase `"aaa ccc"` but to find that out you would have to iterate over all phrases in the string. Maybe a suffix tree can help to go faster. – Ale Nov 07 '18 at 00:42

1 Answers1

4

Thanks everyone for your attempts and comments. I have finally found a solution:

s = 'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'
re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
# 'I hate *some* kinds of duplicate. This string has a duplicate phrase.'

Explanation

The regular expression

r'((\b\w+\b.{1,2}\w+\b)+).+\1'

finds every occurrence of multiple runs of alphanumeric characters separated by one or two [any character] (to cover the case where words are separated not just by a space, but perhaps a period or comma and a space), and then repeated following some run of [any character] of indeterminate length. Then

re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)

replaces such occurrences with the first multiple run of alphanumeric characters separated by one or two [any character], being sure to ignore case (since the duplicate phrase could sometimes occur at the beginning of a sentence).

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • So does this regex only works on the second sentence? – Life is complex Nov 07 '18 at 01:34
  • 3
    I'm asking, because it did not work for this: s = 'I hate *some* kinds of duplicate, duplicate. This string has a duplicate phrase, duplicate phrase.' – Life is complex Nov 07 '18 at 01:35
  • @Lifeiscomplex Your comment uncovered an inaccurate statement I made in response to a comment on the question. I'm specifically looking at phrases of 2+ words, so I did not mean it to cover such a case. I will edit the question & answer later to make this clear for future viewers. – duckmayr Nov 07 '18 at 01:38