1

I am trying to scrape the price of allergy products in Target. For each product, i will input all the US zip codes to see the effect of changing ZIPCODE on price. And i use selenium to input the ZIPCODE for each products. However, i have more than 40000 ZIPCODES and 200 products total to scrape. If I run my code, the run time of the code will be too long(almost 90 days..) because each time it need 2 seconds for selenium to input the zipcode. What should I do to reduce the time of running?

while(True):
    priceArray = []
    nameArray = []
    zipCodeArray =[]
    GMTArray = []

    wait_imp = 10
    CO = webdriver.ChromeOptions()
    CO.add_experimental_option('useAutomationExtension', False)
    CO.add_argument('--ignore-certificate-errors')
    CO.add_argument('--start-maximized')
    wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)

    for url in urlList:
        wd.get(url)
        wd.implicitly_wait(wait_imp)

        for zipcode in zipCodeList:
            try:
                #click the delivery address
                address = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div/div[1]/button[2]")
                address.click()
                #click the Edit location
                editLocation = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div/div[2]/button")
                editLocation.click()
            except:
                #directly click he Edit location
                editLocation = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div[1]/div/div[1]/button")
                editLocation.click()

            #input ZipCode
            inputZipCode = wd.find_element(by=By.XPATH, value="//*[@id='enter-zip-or-city-state']")
            inputZipCode.clear()
            inputZipCode.send_keys(zipcode)

            #click submit
            clickSubmit = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div/div[2]/div/div/div[3]/div/button[1]")
            clickSubmit.click()

            #start scraping
            name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
            nameArray.append(name)
            price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
            priceArray.append(price)
            currentZipCode = zipcode
            zipCodeArray.append(currentZipCode)
            tz = pytz.timezone('Europe/London')
            GMT = datetime.now(tz)
            GMTArray.append(GMT)


    data = {'prod-name': nameArray,
            'Price': priceArray,
            'currentZipCode': zipCodeArray,
            "GMT": GMTArray
            }
    df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"GMT"])
    df.to_csv(r'C:\Users\12987\PycharmProjects\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)
  • Use the threading library to run multiple requests simultaneously and give each process a set of zip codes to check in parallel. https://timber.io/blog/multiprocessing-vs-multithreading-in-python-what-you-need-to-know/#what-is-threading-why-might-you-want-it- – Captain Caveman May 05 '22 at 18:45
  • Try running in headless mode to not have the page visually rendered. It might mess with how the HTML looks though, so you might have to test that your xpath stuff still works. Or figure out how to do it with just regular HTTP requests instead of selenium – Diego Cuadros May 05 '22 at 19:19

1 Answers1

0

For selenium - Use python concurrent.futures to run drivers parallelly.

You may checkout this answer - link

here is a snippet for ThreadPoolExecutor

from selenium import webdriver  
from concurrent import futures

def selenium_title(url):  
  wdriver = webdriver.Chrome() # chrome webdriver
  wdriver.get(url)  
  title = wdriver.title  
  wdriver.quit()
  return title

links = ["https://www.amazon.com", "https://www.google.com"]

with futures.ThreadPoolExecutor() as executor: # default/optimized number of threads or pass `max_workers` param value like max_workers=10
  titles = list(executor.map(selenium_title, links))

You can also use ProcessPoolExecutor

with futures.ProcessPoolExecutor() as executor: # default/optimized number of processes
   titles = list(executor.map(selenium_title, links))

Thus you can achieve x times boost (x = number of workers)

ahmedshahriar
  • 1,053
  • 7
  • 25