0

I am using Scrapy with Selenium in order to scrape urls from a particular search engine (ekoru). Here is a screenshot of the response I get back from the search engine with just ONE request:

enter image description here

Since I am using selenium, I'd assume that my user-agent should be fine so what else could the issue be that makes the search engine detect the bot immediately?

Here is my code:

class CompanyUrlSpider(scrapy.Spider):
    name = 'company_url'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://ekoru.org',
            wait_time=3,
            screenshot=True,
            callback=self.parseEkoru
        )

def parseEkoru(self, response):
    driver = response.meta['driver']
    search_input = driver.find_element_by_xpath("//input[@id='fld_q']")
    search_input.send_keys('Hello World') 


    search_input.send_keys(Keys.ENTER)

    html = driver.page_source
    response_obj = Selector(text=html)

    links = response_obj.xpath("//div[@class='serp-result-web-title']/a")
    for link in links:
        yield {
            'ekoru_URL': link.xpath(".//@href").get()
        }
Luca Guarro
  • 1,085
  • 1
  • 11
  • 25
  • 1
    Unless they have some advanced detection or your IP is already on a blacklist, the user agent seems like the most likely culprit . – Carcigenicate Oct 07 '20 at 13:53
  • @Carcigenicate IP is not blacklisted as I can use the site normally – Luca Guarro Oct 07 '20 at 13:56
  • 1
    Then I'd try a different user agent. – Carcigenicate Oct 07 '20 at 14:17
  • 3
    What's probably happening is that the site is detecting that you're using a webdriver ([Mozilla MDN Doc](https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver)). Here's a great answer that should help you with your scraping issue - [link](https://stackoverflow.com/a/62520191/3613974) and [link](https://stackoverflow.com/a/60403652/3613974) – sm00nie Oct 07 '20 at 14:25

1 Answers1

0

Sometimes you need to pass other parameters in order to avoid being detected by any webpage.

Let me share a code you can use:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#This code helps to simulate a "human being" visiting the website
chrome_options = Options()
chrome_options.add_argument('--start-maximized')
driver = webdriver.Chrome(options=chrome_options, executable_path=r"chromedriver")
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": 
"""Object.defineProperty(navigator, 
'webdriver', {get: () => undefined})"""})

url = 'https://ekoru.org'
driver.get(url)

Yields (Check out below the bar address "Chrome is being controlled..."): Visiting site by a robor