I have a portion of a string which duplicates itself, and I want to remove all duplicate "substrings" without losing the order of the words in the string.
For example: "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE"
Here "BP 425 BRIVE-LA-GAILLARDE"
repeats itself 4 times.
I would like the string to finally be "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE"
where the duplicates have been removed.
This problem is occurring when one of my generic scraper modules is collecting all text elements from a certain HTML Element. In the HTML element the same information is repeated multiple times but is hidden using CSS. This is why I am looking for a generic way of de-duplicating substrings.
More examples of duplicated substrings:
"TOUR SOCIETE SUISSE 1 BD VIVIER MERLE 1 BD VIVIER MERLE"
=> "TOUR SOCIETE SUISSE 1 BD VIVIER MERLE"
"2 PARC DES ERABLES 66 RTE DE SARTROUVILLE 66 RTE DE SARTROUVILLE"
=> "2 PARC DES ERABLES 66 RTE DE SARTROUVILLE"
"CASERNE AUDEOUD 111 AV DE LA CORSE 111 AV DE LA CORSE"
=> "CASERNE AUDEOUD 111 AV DE LA CORSE"
The simple approach to not repeat the same word twice does not work here because in the case when a words is repeated but isn't duplicates for example: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE BP 425 BRIVE LA GAILLARDE"
, Here "LA" between BRIVE and GAILLARDE would be removed.
and the output would be: "12 PL DE LA HALLE BP 425 BRIVE GAILLARDE"
whereas the actual desired output is: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE"
My hunch is one would need to compare sequence of words. But not sure exactly how.
Any help appreciated.