How can I remove duplicate URL's with BeautifulSoup?

Question

I know that my code is finding all of the URL's, which contain duplicates, so I'd like to know how to remove 1 of them or find only 1. Can somebody tell me what I'm doing wrong here? Thank you.

import requests, datetime
from apscheduler.schedulers.blocking import BlockingScheduler


#This function pulls arcade listings from Los Angeles and Orange County craigslist and parses keywords.

def arcade_search():

    now = datetime.datetime.now()
    url1 = 'https://orangecounty.craigslist.org/search/sss?query=arcade&sort=rel'
    url2 = 'https://losangeles.craigslist.org/search/sss?query=arcade&sort=rel'
    r1 = requests.get(url1)
    r2 = requests.get(url2)

    print(r1.status_code, r2.status_code)


    from bs4 import BeautifulSoup
    data1 = r1.text
    data2 = r2.text
    #print(data1)
    soup = BeautifulSoup(data1 + data2, 'html.parser')
    for link in soup.findAll('a'):




        listing1 = link.get('href')
        if 'millipede' in listing1.lower():
            print('millipede was found! ' + listing1)


arcade_search()

See https://docs.python.org/2/library/stdtypes.html#set and scroll up to see the section 5.7. You will see Python has sets to help handle enforcing uniqueness and this reference explains how it works. Also check out https://stackoverflow.com/questions/9718541/reconstructing-absolute-urls-from-relative-urls-on-a-page — yet another David, May 07 '18 at 04:12

score 0 · Answer 1 · answered Feb 27 '18 at 03:44

Generally speaking if you want to search through something and only find unique values, you need to maintain a list of values you have already found. For example:

now = datetime.datetime.now()
url1 = 'https://orangecounty.craigslist.org/search/sss?query=arcade&sort=rel'
url2 = 'https://losangeles.craigslist.org/search/sss?query=arcade&sort=rel'
r1 = requests.get(url1)
r2 = requests.get(url2)
found = []

print(r1.status_code, r2.status_code)


from bs4 import BeautifulSoup
data1 = r1.text
data2 = r2.text
#print(data1)
soup = BeautifulSoup(data1 + data2, 'html.parser')
for link in soup.findAll('a'):




    listing1 = link.get('href')
    if 'millipede' in listing1.lower() and listin1 not in found:
        print('millipede was found! ' + listing1)
        found.append(listing1)

arcade_search()

However in this particular case, there is likely to be more work to do. My guess is you are finding duplicate links across responses the two different get requests, not within either one. That is the links within the response from get(url1) are all unique, and same for the links in the response from get(url2), but you find that some of the links returned in get(url1) show up in get(url2) and vice versa because the Orange County and Los Angeles areas are not disjoint. Moreover, it appears that Craigslist returns links relative to the area in which you are searching, e.g. https://orangecounty.craigslist.org/vgm/d/restored-arcade-game-with-800/6495065079.html vs https://losangeles.craigslist.org/wst/sgd/d/namco-upright-classic-arcade/6511455107.html So when you say that the urls are duplicate, you really mean that two different urls point to the same page. Assuming this is actually what you are experiencing, the above code that I provided won't solve your problem because it can only detect if the urls you find are litterally the same. You could try only keeping track or the part after d/ or maybe only the descriptive string portion (by which I mean for example 'namco-upright-classic-arcade') but I can't guarantee that these will either be the same for the same listing across search regions or different for different listings across regions.

How can I remove duplicate URL's with BeautifulSoup?

1 Answers1