I have a column in my dataframe for articles that looks like this:
id link
1 https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-dun-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
2 https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-d-un-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
3 other link
For example the two first urls look to be the same but change here:
d-un-deal
In my dataframe I have some links that are almost similar. The content is the same but the link change, sometimes the difference between the two links is a letter having an uppercase in one of the link or just other character differing.
Example:
url1 = https://site/presidency...
url2 = https://site/Presidency...
url3 = https://site/news-of-today
url4 = same as url3 but at the end
?autoplay
How can I check all the links and delete the duplicates (similar content but the link is changing a little) ?