Workaround for blocked GET requests in Python

Question

I'm trying to retrieve and process the results of a web search using requests and beautifulsoup.

I've written some simple code to do the job, and it returns successfully (status = 200), but the content of the request is just an error message "We're sorry for any inconvenience, but the site is currently unavailable.", and has been the same for the last several days. Searching within Firefox returns results without issue, however. I've run the code using a URL for the UK-based site and it works without issue so I wonder if the US site is set up to block attempts to scrape web searches.

Are there ways to mask the fact I'm attempting to retrieve search results from within Python (eg, masquerading as a standard search within Firefox) or some other work around to allow access to the search results?

Code included for reference below:

import pandas as pd
from requests import get
import bs4 as bs
import re
# works
# baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8'
# doesn't work
baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0'
a = get(baseURL)
soup = bs.BeautifulSoup(a.content,'html.parser')

info = soup.find_all('div', class_ = 'information-container')
price = soup.find_all('div', class_ = 'vehicle-price')

d = [] 
for idx, i in enumerate(info):
    ii = i.find_next('ul').find_all('li')

    year_ = ii[0].text
    miles = re.sub("[^0-9\.]", "", ii[2].text)
    engine = ii[3].text
    hp = re.sub("[^\d\.]", "", ii[4].text)
    p = re.sub("[^\d\.]", "", price[idx].text)

    d.append([year_, miles, engine, hp, p])

df = pd.DataFrame(d, columns=['year','miles','engine','hp','price'])

How about [changing your user agent](https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python)? — Michael Kolber, Jul 23 '19 at 01:18
@MichaelKolber, looks like it worked. The Mac default I'd seen around didn't, but once I got my actual FF user-agent, data returned as expected. Appreciate it! — Chris, Jul 23 '19 at 01:55

score 23 · Accepted Answer · answered Jul 23 '19 at 02:11

By default, Requests sends a unique user agent when making requests.

>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

It is possible that the website you are using is trying to avoid scrapers by denying any request with a user agent of python-requests.

To get around this, you can change your user agent when sending a request. Since it's working on your browser, simply copy your browser user agent (you can Google it, or record a request to a webpage and copy your user agent like that). For me, it's Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 (what a mouthful), so I'd set my user agent like this:

>>> headers = {
...     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
... }

and then send the request with the new headers (the new headers are added to the default headers, they don't replace them unless they have the same name):

>>> r = requests.get('https://google.com', headers=headers)  # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Now we can see that the request was sent with our preferred headers, and hopefully the site won't be able to tell the difference between Requests and a browser.

The workaround is so easy I don't get why websites waste their time trying to block access from scripts — Imu Sama, Mar 19 '22 at 23:12
@Roronoa_D._Law There's a lot of cargo cult programming in the world, sadly. — jma, Dec 01 '22 at 20:52

Workaround for blocked GET requests in Python

1 Answers1

Linked