What is the best code to remove duplicate url links from a webscraper writing to a csv file?

Question

I'm using Python 3 to write a webscraper to pull URL links and write them to a csv file. The code does this successfully; however, there are many duplicates. How can I create the csv file with only single instances (unique) of each URL?

Thanks for the help!

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

r = requests.get('url')
soup = BeautifulSoup(r.text, 'html.parser')

data = []
for link in soup.find_all('a', href=True):
    if '#' in link['href']:
        pass
    else:
        print(urljoin('base-url',link.get('href'))) 
        data.append(urljoin('base-url',link.get('href')))

with open('test.csv', 'w', newline='') as csvfile:
    write = csv.writer(csvfile)
    for row in data:
        write.writerow([row])

`for row in list(set(data)):`? A `set` cannot have duplicate values, so conversion to a set and back again is a quick fix. You could always make `data = set()` and `.add` entries to the set rather than `.append()` to the list, for the same effect. — roganjosh, Feb 08 '19 at 16:28
Also note that your question really just boils down to "How to remove duplicates from a list?" since you already know how to use all the rest of the infrastructure to get your CSV - the only complication is the contents of the list. — roganjosh, Feb 08 '19 at 16:33
Possible duplicate of [Removing duplicates in lists](https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists) — roganjosh, Feb 08 '19 at 16:34
Just have a check when you append it. See if the item you wanna append is already in the list with `is in` — AsheKetchum, Feb 08 '19 at 16:34
@AsheKetchum no, don't do that. That's O(N) in list lookup for every value so needlessly slow. — roganjosh, Feb 08 '19 at 16:35
@AsheKetchum You mean like a `set` that I already proposed before your suggestion of scanning the entire list? — roganjosh, Feb 08 '19 at 16:40
@roganjosh Why are you salty? The difference is checking before appending and checking after. — AsheKetchum, Feb 08 '19 at 16:43
@AsheKetchum I'm not trying to be salty. You made a comment that was worse than the solution already commented, plus the dupe I raised, and only clarified when I responded. If you knew the hash lookup was essential then it should have been included. Also, it is _not_ the same in terms of speed, at all. Test it. — roganjosh, Feb 08 '19 at 16:47

score 0 · Accepted Answer · answered Feb 08 '19 at 16:51

Using set() somewhere along the line is the way to go. In the code below, I've added that as data = set(data) on its own line to best illustrate the usage. Here, we replace data with set(data), which drops your ~250-url list to around ~130:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r.text, 'html.parser')

data = []
for link in set(soup.find_all('a', href=True)):
    if '#' in link['href']:
        pass
    else:
        print(urljoin('https://www.census.gov',link.get('href')))   
        data.append(urljoin('https://www.census.gov',link.get('href')))

data = set(data)

with open('CensusLinks.csv', 'w', newline='') as csvfile:
    write = csv.writer(csvfile)
    for row in data:
        write.writerow([row])

What is the best code to remove duplicate url links from a webscraper writing to a csv file?

1 Answers1