0

I have a column in my dataframe for articles that looks like this:

id link
1  https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-dun-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
2  https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-d-un-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
3  other link

For example the two first urls look to be the same but change here:

d-un-deal

In my dataframe I have some links that are almost similar. The content is the same but the link change, sometimes the difference between the two links is a letter having an uppercase in one of the link or just other character differing.

Example:

url1 = https://site/presidency...
url2 = https://site/Presidency...

url3 = https://site/news-of-today

url4 = same as url3 but at the end ?autoplay

How can I check all the links and delete the duplicates (similar content but the link is changing a little) ?

jos97
  • 405
  • 6
  • 18

1 Answers1

1

Here is one solution:

Find the similarity metric between two strings

You could use a metric for this. Decide which similarity you want to use.

Poopaye
  • 44
  • 9
  • I don 't think this will work. Having a close similarity doesn't assure that the link has the same or different content. – Celius Stingher Jan 25 '21 at 11:58
  • You can't avoid this problem, but he could never do a function with all cases, cause the cases could change (and every case [like another number at the end] don't implies it is the same content). – Poopaye Jan 25 '21 at 12:00
  • OP could try and send a request and compare the results – Celius Stingher Jan 25 '21 at 12:40
  • 1
    You should write an answer and not comments. If you have a solution reply to OP! I just wrote it would be ONE solution, not the one and only... – Poopaye Jan 25 '21 at 12:41