Web scraping keeps blocking website even after mentioning proxy server

Question

I am scrapping the website craiglist.com but after getting certain requests it keeps blocking my device. I tried out the solution in Proxies with Python 'Requests' module but didn't understand how to specify the headers every time. Here's the code :

from bs4 import BeautifulSoup
import requests,json

list_of_tuples_with_given_zipcodes = []
id_of_apartments = []

params = {
    'sort': 'dd',
    'filter': 'reviews-dd',
    'res_id': 18439027
}

http_proxy  = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy   = "ftp://10.10.1.10:3128"

proxies = { 
              "http"  : http_proxy, 
              "https" : https_proxy, 
              "ftp"   : ftp_proxy
            }

for i in range(1,30):
    content = requests.get('https://losangeles.craigslist.org/search/apa?s = ' + str(i),params = params)   #https://losangeles.craigslist.org/search/apa?s=120
    # content = requests.get('https://www.zillow.com/homes/for_rent/')
    soup = BeautifulSoup(content.content, 'html.parser')
    my_anchors = list(soup.find_all("a",{"class": "result-image gallery"}))
    for index,each_anchor_tag in enumerate(my_anchors):
        URL_to_look_for_zipcode = soup.find_all("a",{"class": "result-title"})      #taking set so that a page is not visited twice.
    for each_href in URL_to_look_for_zipcode:
        # content_href = requests.get(each_href['href'])   #script id="ld_posting_data" type="application/ld+json">
        content_href = requests.get(each_href['href'])   #script id="ld_posting_data" type="application/ld+json">
        # print(each_href['href'])
        soup_href = BeautifulSoup(content_href.content, 'html.parser')
        my_script_tags = soup_href.find("script",{"id": "ld_posting_data"})
    # for each_tag in my_script_tags:
        if my_script_tags:
            res = json.loads(str(list(my_script_tags)[0]))
            if res and 'address' in list(res.keys()):
                if res['address']['postalCode'] == "90012":    #use the input zipcode entered by the user.
                    list_of_tuples_with_given_zipcodes.append(each_href['href'])

I am still not sure about the value of the http_proxy variable. I specified it as what was given but should it be the IP address of my device mapped to the localhost port number? It still keeps blocking the code.

Please help.

Proxy settings are for use on work or school networks which only allow internet access through a proxy server they control, and software has to be told which proxy to use. If you don't have to use a proxy server to get to the internet, the settings are not necessary. (A HTTP proxy will not help you scrape websites) — TessellatingHeckler, Apr 20 '22 at 17:47
Also doing this is against Craigslist's terms of service, and therefore potentially illegal under computer misuse laws, and off-topic here for asking about getting around a company's protections which they put in to stop people doing this kind of thing. ("*You agree not to copy/collect CL content via robots, spiders, scripts, scrapers, crawlers, or any automated or manual equivalent (e.g., by hand)*" - https://www.craigslist.org/about/terms.of.use/en ) — TessellatingHeckler, Apr 20 '22 at 17:49

score 0 · Answer 1 · answered Apr 20 '22 at 17:57

0

request's GET method lets you specify the proxy to use it on a call

r = requests.get(url, headers=headers, proxies=proxies)

answered Apr 20 '22 at 17:57

Nacho R

71
1
10

what should be the definition of `headers` and `proxies` here? Is 'proxies' going to be the same as mentioned in my code? – QUEEN Apr 20 '22 at 19:07
Headers and proxies are dictionaries, you can ignore headers since you are not using them and keep the proxies with the same format that you defined – Nacho R Apr 21 '22 at 20:08
Thank you so much for reaching out. But even using the proxies blocks the device after getting certain requests. – QUEEN Apr 21 '22 at 22:14

Web scraping keeps blocking website even after mentioning proxy server

1 Answers1