How to remove Duplicate SubStrings

Question

I have a portion of a string which duplicates itself, and I want to remove all duplicate "substrings" without losing the order of the words in the string.

For example: "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE"

Here "BP 425 BRIVE-LA-GAILLARDE" repeats itself 4 times.

I would like the string to finally be "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE" where the duplicates have been removed.

This problem is occurring when one of my generic scraper modules is collecting all text elements from a certain HTML Element. In the HTML element the same information is repeated multiple times but is hidden using CSS. This is why I am looking for a generic way of de-duplicating substrings.

More examples of duplicated substrings:

 "TOUR SOCIETE SUISSE 1 BD VIVIER MERLE 1 BD VIVIER MERLE"
      => "TOUR SOCIETE SUISSE  1 BD VIVIER MERLE"

 "2 PARC DES ERABLES 66 RTE DE SARTROUVILLE 66 RTE DE SARTROUVILLE"
      => "2 PARC DES ERABLES  66 RTE DE SARTROUVILLE"

 "CASERNE AUDEOUD 111 AV DE LA CORSE 111 AV DE LA CORSE"
      => "CASERNE AUDEOUD  111 AV DE LA CORSE"

The simple approach to not repeat the same word twice does not work here because in the case when a words is repeated but isn't duplicates for example: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE BP 425 BRIVE LA GAILLARDE", Here "LA" between BRIVE and GAILLARDE would be removed.

and the output would be: "12 PL DE LA HALLE BP 425 BRIVE GAILLARDE" whereas the actual desired output is: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE"

My hunch is one would need to compare sequence of words. But not sure exactly how.

Any help appreciated.

Are there really double spaces between the duplicates or did you add them to the question for convenience? — chuck, Mar 02 '20 at 17:25
What should be the output for this? ``` "2 PARC DES ERABLES 66 RTE DE SARTROUVILLE 66 RTE DE 66 DE " ``` — Shubham Nagota, Mar 02 '20 at 17:25
How about this: https://stackoverflow.com/questions/7794208/how-can-i-remove-duplicate-words-in-a-string-with-python — DZurico, Mar 02 '20 at 17:25
@DZurico a simple filter for duplicate space-delimited words would remove the second `"DE"` from `"12 PL DE LA HALLE 66 RTE DE SARTROUVILLE 66 RTE DE SARTROUVILLE"`. — jfaccioni, Mar 02 '20 at 17:30
@chuck2002 I will remove the double spaces, not important to the problem. It was an excel copy paste problem. — Codious-JR, Mar 02 '20 at 17:32
maybe you should explain us how you get those duplicate strings. Maybe the solution could be done before you get this problem — Phoenixo, Mar 02 '20 at 17:32
@phoenixo will add the explanation of the source the problem in the questions — Codious-JR, Mar 02 '20 at 17:34
@Codious-JR oh OK, I just thought if the double spaces were in the real data maybe they could be used somehow. — chuck, Mar 02 '20 at 17:49

Tim Biegeleisen · Accepted Answer · 2020-03-02T18:15:45.557

Here is a potentially viable regex based solution:

inp = "12 PL DE LA HALLE  BP 425 BRIVE-LA-GAILLARDE  BP 425 BRIVE-LA-GAILLARDE  BP 425 BRIVE-LA-GAILLARDE  BP 425 BRIVE-LA-GAILLARDE"
while True:
    out = re.sub(r'(?<!\S)(\S+(?:\s\S+)*)\s+\1(?!\S)', '\\1', inp)
    if out == inp:
        break
    inp = out
print(out)

This prints:

12 PL DE LA HALLE  BP 425 BRIVE-LA-GAILLARDE

The idea here is to match any phrase which is followed by the same phrase, and then to replace with just the first captured phrase.

We use a recursive re.sub here, because once PHRASE PHRASE has been processed and replaced with just a single PHRASE, that remaining phrase won't be used again.

Here is an explanation of the regex pattern:

(?<!\S)         assert what precedes is either whitespace or the start of the string
(               match AND capture the following in \1
    \S+         match one or more non whitespace characters (i.e. a "word")
    (?:\s\S+)*  then match a space followed by another word, zero or more times
)
\s+             match one or more whitespace characters
\1              then match the same phrase we just saw         
(?!\S)          assert that whitespace or the end of the string follows

I am not sure if the double space occurs all the time. In this example it does, but i dont think it will always reproduce itself. — Codious-JR, Mar 02 '20 at 17:58
@Codious-JR If your data is inconsistent then please fix the source of the data. If your data itself is inconsistent it's really hard for anyone to help. — Ch3steR, Mar 02 '20 at 18:03
Awesome that did the trick! Could you please explain the regex pattern? its super complex :D — Codious-JR, Mar 02 '20 at 18:09

score -1 · Answer 2 · answered Mar 02 '20 at 17:35

-1

Please checkout this.

str = "2 PARC DES ERABLES  66 RTE DE SARTROUVILLE  66 RTE DE SARTROUVILLE"

output_str = ""
output_list =[]
for sub_str in str.split():
    if sub_str not in output_list:
        output_list.append(sub_str)
        output_str = output_str + " " + sub_str

print(output_str.strip())

answered Mar 02 '20 at 17:35

krishna

1,029
6
12

1

your implementation doesn't work when same words are repeated in the but aren't duplicates for example: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE BP 425 BRIVE LA GAILLARDE", Here "LA" between BRIVE and GAILLARDE would be removed – Codious-JR Mar 02 '20 at 17:43
But LA is metion between **DE and HALLE** and **BRIVE and GAILLARDE** So it has to remove one according your requirement – krishna Mar 02 '20 at 17:46
The expected output is "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE", and your implementation would provide "12 PL DE LA HALLE BP 425 BRIVE GAILLARDE", where the second "LA" would be missing – Codious-JR Mar 02 '20 at 17:48
Code was updated did you check with the current code? Can you please check it. – krishna Mar 02 '20 at 17:50

How to remove Duplicate SubStrings

2 Answers2