Scrape for Absolute URL with html.parse and remove duplicates

Question

I am trying to make sure that the relative links are saved as absolute links into this CSV. (URL parse) I am also trying to remove duplicates, which is why I created the variable "ddupe".

I keep getting all the relative URLs saved when I open the csv in the desktop. Can someone please help me figure this out? I thought about calling the "set" just like this page: How do you remove duplicates from a list whilst preserving order?

#Importing the request library to make HTTP requests
#Importing the bs4 library to extract / parse html and xml files
#utlize urlparse to change relative URL to absolute URL
#import csv (built in package) to read / write to Microsoft Excel
from bs4 import BeautifulSoup  
import requests
from urllib.parse import urlparse
import csv

#create the page variable
#associate page to request to obtain the information from raw_html
#store the html information in a text
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
parsed = urlparse(page)
raw_html = page.text    # declare the raw_html variable

soup = BeautifulSoup(raw_html, 'html.parser')  # parse the html

#remove duplicate htmls
ddupe = open(‘page.text’, ‘r’).readlines() 
ddupe_set = set(ddupe)
out = open(‘page.text’, ‘w’)
for ddupe in ddupe_set:
    out.write(ddupe)

T = [["US Census Bureau Links"]] #Title

#Finds all the links
links = map(lambda link: link['href'], soup.find_all('a', href=True)) 

with open("US_Census_Bureau_links.csv","w",newline="") as f:    
    cw=csv.writer(f, quoting=csv.QUOTE_ALL) #Create a file handle for csv writer                          
    cw.writerows(T)             #Creates the Title
    for link in links:                                  #Parses the links in the csv
    cw.writerow([link])                        

f.close()       #closes the program

score 0 · Answer 1 · answered Sep 26 '19 at 12:58

The function you're looking for is urljoin, not urlparse (both from the same package urllib.parse). It should be used somewhere after this line:

links = map(lambda link: link['href'], soup.find_all('a', href=True))

Use a list comprehension or map + lambda like you did here to join the relative URLs with base paths.

Scrape for Absolute URL with html.parse and remove duplicates

1 Answers1