How to avoid a bot detection and scrape a website using python?

Question

My Problem:

I want to scrape the following website: https://www.coches.net/segunda-mano/. But every time i open it with python selenium, i get the message, that they detected me as a bot. How can i bypass this detection? First i tried simple code with selenium:

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Chrome('C:/Python38/chromedriver.exe')
URL = 'https://www.coches.net/segunda-mano/'
browser.get(URL)

Then i tried it with request, but i doesn't work, too.

from selenium import webdriver
from bs4 import BeautifulSoup

from fake_useragent import UserAgent

import requests

ua = UserAgent()

headers = {"UserAgent":ua.random}

URL = 'https://www.coches.net/segunda-mano/'
r = requests.get(URL, headers = headers)

print(r.statuscode)

In this case i get the message 403 = Status code stating that access to the URL is prohibited.

Don't know how to get entry to this webpage without getting blocked. I would be very grateful for your help. Thanks in advance.

Consider to upvote and mark as done the answer that solve your problem. — Joaquin, Aug 24 '21 at 15:21

score 15 · Answer 1 · answered Aug 23 '21 at 15:52

Selenium is fairly easily detected, especially by all major anti-bot providers (Cloudflare, Akamai, etc).

Why?

Selenium, and most other major webdrivers set a browser variable (that websites can access) called navigator.webdriver to true. You can check this yourself by heading to your Google Chrome console and running console.log(navigator.webdriver). If you're on a normal browser, it will be false.
The User-Agent, typically all devices have what is called a "user agent", this refers to the device accessing the website. Selenium's User-Agent looks something like this: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36. Did you catch that? HeadlessChrome is included, this is another route of detection.

These are just two of the multiple ways a Selenium browser can be detected, I would highly recommend reading up on this and this as well.

And lastly, if you want an easy, drop-in solution to bypass detection that implements almost all of these concepts we've talked about, I'd suggest using undetected-chromedriver. This is an open source project that tries it's best to keep your Selenium chromedriver looking human.

score 3 · Answer 2 · answered Aug 23 '21 at 17:16

I think your problem is not bot detection. You can't use just requests to get the results from that page, because it makes XHR requests behind the scene. So you must use Selenium, splash, etc, but seems is not possible for this case.

However, if you research a bit in the page you can find which url is requested behind the scenes to display the resutls. I did that research and found this page(https://ms-mt--api-web.spain.advgo.net/search), it returns a json, so it will ease your work in terms of parsing. Using chrome dev tools I got the curl request and just map it to python requests and obtain this code:

import json
import requests

headers = {
    'authority': 'ms-mt--api-web.spain.advgo.net',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/json, text/plain, */*',
    'x-adevinta-channel': 'web-desktop',
    'x-schibsted-tenant': 'coches',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
    'content-type': 'application/json;charset=UTF-8',
    'origin': 'https://www.coches.net',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.coches.net/',
    'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}

data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'


while True:
    response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
    # you should parse items here.
    print(response)
    if not response["items"]:
        break
    data_dict = json.loads(data)
    data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1 # get the next page.
    data = json.dumps(data_dict)

Probably there are a lot of headers and body info that are unnecesary, you can code-and-test to improve it.

score 0 · Answer 3 · answered May 21 '22 at 04:56

0

Proxy rotating can be useful if scraping large data

options = Options()
options.add_arguments('--proxy-server="#ip:#port"')

Then initialize chrome driver with options object

answered May 21 '22 at 04:56

Old Gaming is Gold Gaming

41
8

How to avoid a bot detection and scrape a website using python?

3 Answers3

Linked