0

I'm using Python 3 to write a webscraper to pull URL links and write them to a csv file. The code does this successfully; however, there are many duplicates. How can I create the csv file with only single instances (unique) of each URL?

Thanks for the help!

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

r = requests.get('url')
soup = BeautifulSoup(r.text, 'html.parser')

data = []
for link in soup.find_all('a', href=True):
    if '#' in link['href']:
        pass
    else:
        print(urljoin('base-url',link.get('href'))) 
        data.append(urljoin('base-url',link.get('href')))

with open('test.csv', 'w', newline='') as csvfile:
    write = csv.writer(csvfile)
    for row in data:
        write.writerow([row])
JJTX
  • 67
  • 1
  • 4
  • 1
    `for row in list(set(data)):`? A `set` cannot have duplicate values, so conversion to a set and back again is a quick fix. You could always make `data = set()` and `.add` entries to the set rather than `.append()` to the list, for the same effect. – roganjosh Feb 08 '19 at 16:28
  • Also note that your question really just boils down to "How to remove duplicates from a list?" since you already know how to use all the rest of the infrastructure to get your CSV - the only complication is the contents of the list. – roganjosh Feb 08 '19 at 16:33
  • 2
    Possible duplicate of [Removing duplicates in lists](https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists) – roganjosh Feb 08 '19 at 16:34
  • Just have a check when you append it. See if the item you wanna append is already in the list with `is in` – AsheKetchum Feb 08 '19 at 16:34
  • @AsheKetchum no, don't do that. That's O(N) in list lookup for every value so needlessly slow. – roganjosh Feb 08 '19 at 16:35
  • @roganjosh what if you checked with a hash table? :) – AsheKetchum Feb 08 '19 at 16:39
  • @AsheKetchum You mean like a `set` that I already proposed before your suggestion of scanning the entire list? – roganjosh Feb 08 '19 at 16:40
  • @roganjosh Why are you salty? The difference is checking before appending and checking after. – AsheKetchum Feb 08 '19 at 16:43
  • @AsheKetchum I'm not trying to be salty. You made a comment that was worse than the solution already commented, plus the dupe I raised, and only clarified when I responded. If you knew the hash lookup was essential then it should have been included. Also, it is _not_ the same in terms of speed, at all. Test it. – roganjosh Feb 08 '19 at 16:47

1 Answers1

0

Using set() somewhere along the line is the way to go. In the code below, I've added that as data = set(data) on its own line to best illustrate the usage. Here, we replace data with set(data), which drops your ~250-url list to around ~130:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r.text, 'html.parser')

data = []
for link in set(soup.find_all('a', href=True)):
    if '#' in link['href']:
        pass
    else:
        print(urljoin('https://www.census.gov',link.get('href')))   
        data.append(urljoin('https://www.census.gov',link.get('href')))

data = set(data)

with open('CensusLinks.csv', 'w', newline='') as csvfile:
    write = csv.writer(csvfile)
    for row in data:
        write.writerow([row])
Chris Larson
  • 1,684
  • 1
  • 11
  • 19