How do I remove duplicated domains and invalid HTML files from a list in python?

Question

I have a list of urls. How do I remove the urls if it is not a valid html for example I want to remove pdf/jpy files and also want to remove duplicated domains.

example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
'https://the1955club.com/', 
'https://the1955club.com/aboutus']

so in the new list it should return the below

new_list = ['https://the1955club.com/']

you could use `urllib.parse` to split url into parts and keep urls which don't have anything after domain. Eventually you can `split('/')` and keep urls which have only 3 pasts (or 4 parts but last is empty) — furas, Sep 13 '22 at 15:06

score 0 · Answer 1 · answered Sep 13 '22 at 15:11

First: in Python it is better to create new list with elements which you want to keep and later assign it to old variable.

You can use urllib.parse.urlparse(url) to split url and check if .path is "" or "/"

import urllib.parse

example_list = [
    'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
    'https://the1955club.com/', 
    'https://the1955club.com/aboutus'
]

new_list = []

for url in example_list:
    parts = urllib.parse.urlparse(url)
    if parts.path in ("", "/"):
        new_list.append(url)
        
example_list = new_list

print(example_list)

EDIT:

But if you want to filter urls which really are not valid then you may need to use requests and try to get this url. If you don't get result then url is wrong.

score 0 · Accepted Answer · answered Sep 13 '22 at 15:17

This will check both the url and the file path, if it finds a repeated URL or a bad filepath (ending with .pdf, .jpeg, etc.) it will ignore it.

import re

valid_domains = set()

example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
'https://the1955club.com/', 
'https://the1955club.com/aboutus'
'http://the1955club.com']

print()

for example in example_list:
    res = re.search('(http[s]??:\/\/.+?)(\/.*)', example)
    extension = re.findall('^.*\.(jpg|JPG|gif|GIF|doc|DOC|pdf|PDF|jpy|JPY)$', res.group(2)) # https://stackoverflow.com/questions/374930/validating-file-types-by-regular-expression
    if res.group(1) in valid_domains or len(extension):
        continue

    valid_domains.add(res.group(1))
    
print(valid_domains)

Ouput:

 {'https://the1955club.com'}

How do I remove duplicated domains and invalid HTML files from a list in python?

2 Answers2