0

I have a working code to fetch details from a URL using selenium and Python. But facing an issue after searching 50 plus URL's the google chrome is showing up "I'm not a Robot" option and ask to select the checkbox.

But after that unable to get the results and thereafter no consistent results or false results are showing.

So is there a way to avoid this "I'm not a Robot" captcha and to get consistent results? Or Anything that I need to modify in this code to make it more optimized?

Also is it possible to open 50 or 100 tabs in the chrome driver at the same time and search the loaded tabs for the results?

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import psycopg2
import os
import datetime

final_results=[]
positions=[]

option = webdriver.ChromeOptions()
option.add_argument("—-incognito")
browser = webdriver.Chrome(executable_path='/users/user_123/downloads/chrome_driver/chromedriver', chrome_options=option)

#def db_connect():
try:
    #Database connection string
     DSN = "dbname='postgres' user='postgres' host='localhost' password='postgres' port='5432'"
     #DWH table to which data is ported
     TABLE_NAME = 'staging.search_url'
     #Connecting DB..
     conn = psycopg2.connect(DSN)
     print("Database connected...")
     #conn.set_client_encoding('utf-8')
     cur = conn.cursor()
     cur.execute("SET datestyle='German'")
except (Exception, psycopg2.Error) as error:
     print('database connection failed')
     quit()

def get_products(url):
    browser.get(url)
    names = browser.find_elements_by_xpath("//span[@class='pymv4e']")
    upd_product_name_list=list(filter(None, names))
    product_name = [x.text for x in upd_product_name_list]
    product = [x for x in product_name if len(x.strip()) > 2]
    upd_product_name_list.clear()
    product_name.clear()
    return product

##################################
search_url_fetch="""select url_to_be_searched from staging.search_url where id in(65,66,67,68)"""
psql_cursor = conn.cursor()
psql_cursor.execute(search_url_fetch)
serach_url_list = psql_cursor.fetchall()
print('Fetched DB values')
##################################

for row in serach_url_list:
    passed_url=''
    passed_url=str(row)
    passed_url=passed_url.replace(',)','')
    passed_url=passed_url.replace('(','')
    new_url=''
    new_url=passed_url[1:len(passed_url)-1]
    print('Passed URL :'+new_url)
    print("\n")

    filtered=[]
    filtered.clear()
    filtered = get_products(new_url)
    if not filtered:
        new_url=new_url+'+kaufen'
        get_products(new_url)
        print('Modified URL :'+new_url)

    if filtered:
         print(filtered)
         positions.clear()
         for x in range(1, len(filtered)+1):
           positions.append(str(x))
         gobal_position=0
         gobal_position=len(positions)
         print('global postion first: '+str(gobal_position))
         print("\n")

         company_name_list = browser.find_elements_by_xpath("//div[@class='LbUacb']")
         # use list comprehension to get the actual repo titles and not the selenium objects.
         company = []
         company.clear()
         company = [x.text for x in company_name_list]
         # print out all the titles.
         print('Company Name:')
         print(company, '\n')


         price_list = browser.find_elements_by_xpath("//div[@class='e10twf T4OwTb']")
         # use list comprehension to get the actual repo titles and not the selenium objects.
         price = []
         price.clear()
         price = [x.text for x in price_list]
         print('Price:')
         print(price)
         print("\n")

         urls=[]
         urls.clear()
         find_href = browser.find_elements_by_xpath("//a[@class='plantl pla-unit-single-clickable-target clickable-card']")
         for my_href in find_href:
             url_list=my_href.get_attribute("href")
             urls.append(url_list)
             #print(my_href.get_attribute("href"))
         print('URLS:')
         print(urls)
         print("\n")

         print('Final Result: ')
         result = zip(positions,filtered, urls, company,price)
         final_results.clear()
         final_results.append(tuple(result))
         print(final_results)
         print("\n")


         print('global postion end :'+str(gobal_position))
         i=0
         #try:
         for d in final_results:
                print( d[i])
                while i <= gobal_position:
                  cur.execute("""INSERT into staging.pla_crawler_results(position, product_name, url,company,price) VALUES (%s, %s, %s,%s, %s)""", d[i])
                  print('Inserted succesfully')
                  conn.commit()
                  i=i+1
         #except (Exception, psycopg2.Error) as error:
             #pass
saeed foroughi
  • 1,662
  • 1
  • 13
  • 25
Sandeep
  • 671
  • 2
  • 7
  • 30
  • I'm not sure this is an answerable question -- as afarley says, any answer (other than to use an approach endorsed by Google, which at large volumes will require paying them money) is entering into an arms race, and going to eventually break. – Charles Duffy Jan 22 '20 at 20:30
  • probably not. Opening many tabs won't help. Possibly restarting the driver every x number of searches? Getting around captchas seems sort of hostile to me though... Be a Good Internet Citizen™ – pcalkins Jan 22 '20 at 20:30
  • ...keep in mind that SO's goal is to be a giant / long-tail FAQ. Something that works this month and breaks the next is not really a suitable FAQ entry. Indeed, one could argue that as the scope will be constantly growing to encompass new countermeasures and counter-countermeasures, that it's innately "too broad" to be answerable. – Charles Duffy Jan 22 '20 at 20:32

2 Answers2

0

You have two options:

1) Pay for access to Google's search API. This is the professional way to avoid getting banned.

2) Randomize your script to make it look more human. This approach is an arms-race against Google; you can probably get your script working, but it will break periodically.

More optimization (in the performance sense) would probably make this problem worse, not better.

afarley
  • 781
  • 6
  • 26
0

Things to try with your code:

  1. Launch Chrome with a head. It doesn't look like you're using headless but just a reminder there.
  2. Randomize time between interacting with the web page. The fastest way to get a robo-check is trying to go faster.
  3. Exclude the enable-automation argument; see https://stackoverflow.com/a/56635123/9642

Alternatives:

afarley suggested paying for Google's Search API but I wasn't aware this was even an option. Google used to have a basic search API for free for the first 10 results (with heavy quota limitations) but I don't see that available anymore.

Neil C. Obremski
  • 18,696
  • 24
  • 83
  • 112