-1

I have a list of urls. How do I remove the urls if it is not a valid html for example I want to remove pdf/jpy files and also want to remove duplicated domains.

example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
'https://the1955club.com/', 
'https://the1955club.com/aboutus']

so in the new list it should return the below

new_list = ['https://the1955club.com/']

Kelly Tang
  • 19
  • 5
  • you could use `urllib.parse` to split url into parts and keep urls which don't have anything after domain. Eventually you can `split('/')` and keep urls which have only 3 pasts (or 4 parts but last is empty) – furas Sep 13 '22 at 15:06

2 Answers2

0

First: in Python it is better to create new list with elements which you want to keep and later assign it to old variable.


You can use urllib.parse.urlparse(url) to split url and check if .path is "" or "/"

import urllib.parse

example_list = [
    'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
    'https://the1955club.com/', 
    'https://the1955club.com/aboutus'
]

new_list = []

for url in example_list:
    parts = urllib.parse.urlparse(url)
    if parts.path in ("", "/"):
        new_list.append(url)
        
example_list = new_list

print(example_list)

EDIT:

But if you want to filter urls which really are not valid then you may need to use requests and try to get this url. If you don't get result then url is wrong.

furas
  • 134,197
  • 12
  • 106
  • 148
0

This will check both the url and the file path, if it finds a repeated URL or a bad filepath (ending with .pdf, .jpeg, etc.) it will ignore it.

import re

valid_domains = set()

example_list = [
'https://ocp.dc.gov/sites/default/files/dc/sites/ocp/publication/attachments/Report-of-Contracting-Activity-Part-I.pdf', 
'https://the1955club.com/', 
'https://the1955club.com/aboutus'
'http://the1955club.com']

print()

for example in example_list:
    res = re.search('(http[s]??:\/\/.+?)(\/.*)', example)
    extension = re.findall('^.*\.(jpg|JPG|gif|GIF|doc|DOC|pdf|PDF|jpy|JPY)$', res.group(2)) # https://stackoverflow.com/questions/374930/validating-file-types-by-regular-expression
    if res.group(1) in valid_domains or len(extension):
        continue

    valid_domains.add(res.group(1))
    
print(valid_domains)
    
    

Ouput:

 {'https://the1955club.com'}
JRose
  • 1,382
  • 2
  • 5
  • 18