Remove duplicate pages from file

Question

I'm cracking my head around this problem.

I have a list of urls and I want to keep the firstr unique url per php page.

So example input:

Example Output:

So it must clean a file and only output one url per unique page.

Does this answer your question? [Removing duplicates in lists](https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists) — AcK, Sep 22 '20 at 11:35
Does this answer your question? [How might I remove duplicate lines from a file?](https://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file) — Tomerikoo, Sep 22 '20 at 12:58

A.B · Answer 1 · 2020-09-22T12:54:36.700

You can parse the url either yourself with regex/split() or with url parser package like urllib.parse

Store path(page) in a dict (it will give you fast lookup O(n) on average) and see if its not already there,if not then add page as key and url as value.

Take dict values, it will give you unique urls only

from urllib.parse import urlparse

list_url = [
"http://www.example.com/index.php?id=1",
"http://www.example.com/index.php?id=2",
"http://www.example.com/page.php?id=1",
"http://www.example.com/page.php?id=2",
"blog.example.com/page.php?id=2",
  "subdomain.example.com/folder/page.php?id=2"
 ]

mydict = {}
for url in list_url:
    url_parsed =urlparse(url)
    path = url_parsed.path
    
    if path not in mydict:
        mydict[path] = url

Taking dictionary value and converting to list

 print(list(mydict.values()))

as @waps converted this to similar but list_comphension structure, you can do it if having first id is not your concern.

list({ urlparse(url).path:url for url in list_url }.values())

Output

['http://www.example.com/index.php?id=2',

'http://www.example.com/page.php?id=2', 'blog.example.com/page.php?id=2', 'subdomain.example.com/folder/page.php?id=2']

Thanks, seems logical.. But what about adding these two urls to the list: "http://blog.example.com/page.php?id=2", "http://subdomain.example.com/folder/page.php?id=2" Those are also unique, but when I add these to the list the original http://www.example.com/page?id=1 is not added — John Geenen, Sep 22 '20 at 12:18
For the question, it was tricky to be judge. I do close questions myself when there is a need. Question was looking well-asked (had inputs and outputs) and problem(removing 2nd occurrence of url ). Moreover, we need to be nice to new comers. — A.B, Sep 22 '20 at 13:03

Remove duplicate pages from file

1 Answers1