Pandas: How to avoid duplicated value when the value is an url?

Question

I have a column in my dataframe for articles that looks like this:

id link
1  https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-dun-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
2  https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-d-un-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
3  other link

For example the two first urls look to be the same but change here:

d-un-deal

In my dataframe I have some links that are almost similar. The content is the same but the link change, sometimes the difference between the two links is a letter having an uppercase in one of the link or just other character differing.

Example:

url1 = https://site/presidency...
url2 = https://site/Presidency...

url3 = https://site/news-of-today

url4 = same as url3 but at the end ?autoplay

How can I check all the links and delete the duplicates (similar content but the link is changing a little) ?

create a function to detect duplicate rows then use `apply()` to filter the rows — deadshot, Jan 25 '21 at 10:37
Thank you, the problem is that sometimes they are not real duplicates. Because the value of url change in just one character for example — jos97, Jan 25 '21 at 10:55
For the upper/lowercase you can default all the text to lower and delete duplicates, for the other cases I don't really know — Celius Stingher, Jan 25 '21 at 11:01

Poopaye · Answer 1 · 2021-01-25T12:44:01.000

1

Here is one solution:

Find the similarity metric between two strings

You could use a metric for this. Decide which similarity you want to use.

edited Jan 25 '21 at 12:44

answered Jan 25 '21 at 11:05

Poopaye

44
9

I don 't think this will work. Having a close similarity doesn't assure that the link has the same or different content. – Celius Stingher Jan 25 '21 at 11:58
You can't avoid this problem, but he could never do a function with all cases, cause the cases could change (and every case [like another number at the end] don't implies it is the same content). – Poopaye Jan 25 '21 at 12:00
OP could try and send a request and compare the results – Celius Stingher Jan 25 '21 at 12:40
1

You should write an answer and not comments. If you have a solution reply to OP! I just wrote it would be ONE solution, not the one and only... – Poopaye Jan 25 '21 at 12:41

Pandas: How to avoid duplicated value when the value is an url?

1 Answers1