0

Python noob here so I'll try to provide as much detail as I can. I'm experimenting with Python's concurrent futures module to see if I can speed up some scraping using Selenium. I'll scrape some financial data from a site using the following URLs in a csv file titled "inputURLS.csv." We'll keep the list of stocks short and have one fake stock to deal with an exception. The actual URL csv is longer so I'd like to try to pull from a csv rather than type out an array in my python script.

https://www.benzinga.com/quote/TSLA
https://www.benzinga.com/quote/AAPL
https://www.benzinga.com/quote/XXXX
https://www.benzinga.com/quote/SNAP

Here is my python code to extract 3 items of data - share number, market cap, and PE ratio. The script works fine outside of concurrent futures.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import csv
import concurrent.futures
from random import randint
from time import sleep

options = webdriver.ChromeOptions()
#options.add_argument("--headless") #optional headless
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ['enable-automation'])
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options, executable_path=r'D:\SeleniumDrivers\Chrome\chromedriver.exe')
driver.execute_cdp_cmd('Network.setUserAgentOverride',{"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'})

OutputFile = open('CSVoutput.csv', 'a')
urlList = []

with open('inputURLS.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        urlList.append(row[0])
    print (urlList) #make array visible in viewer

def extract(theURLS):
    for i in urlList:
        driver.get(i)
        sleep(randint(3, 10)) # random pause
        try:
            bz_shares = driver.find_element_by_css_selector('div.flex:nth-child(10) > div:nth-child(2)').text #get shares number
            print(bz_shares) # to see in viewer
            OutputFile.write(bz_shares) # save number to csv output
        except NoSuchElementException:
            print("N/A") # print N/A if stock does not exist
            OutputFile.write("N/A") # save non value to csv output
        try:
            bz_MktCap = driver.find_element_by_css_selector('div.flex:nth-child(5) > div:nth-child(2)').text #get market cap
            print(bz_MktCap) # to see in viewer
            OutputFile.write("," + bz_MktCap) # save market cap to csv output
        except NoSuchElementException:
            print("N/A") # print N/A if no value
            OutputFile.write(",N/A") # save non value to csv output
        try:
            bz_PE = driver.find_element_by_css_selector('div.flex:nth-child(8) > div:nth-child(2)').text #get PE ratio
            print(bz_PE) # to see in viewer
            OutputFile.write("," + bz_PE) # save PE ratio to csv output
        except NoSuchElementException:
            print("N/A") # print N/A if no value
            OutputFile.write(",N/A") # save non value to csv output
        print(driver.current_url) # see URL screen in viewer
        OutputFile.write("," + driver.current_url + "\n") # save URL to csv output

        return theURLS

with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(extract, urlList)

When I run the script I get the following results to my ouput file:

963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA

So the script is looping through my csv file but it's stuck on the first row. I get 4 rows of data back - which is the number of URLs I'm starting with - but I only get data back for the first URL. If I had 8 URLs, the same thing happens 8 times, etc. I don't think I'm looping correctly through the URLlist array in my function. Would appreciate any assistance to fix this. I put this together with various sites and youtube videos I've watched on concurrent futures but I'm totally stuck. Thanks so much!

RTaylor
  • 43
  • 1
  • 3
  • Your `extract()` should be a method applied to each item in `urlList`, not one the accepts the list itself. – JonSG May 28 '21 at 15:57
  • You will also need a `threading.Lock()` when doing your write. see: https://stackoverflow.com/questions/33107019/multiple-threads-writing-to-the-same-csv-in-python – JonSG May 28 '21 at 16:02
  • you would need a separate driver/browser for each thread anyway. You could for instance, break apart your list into 4 parts and run those concurrently using a new webdriver instance for each. – pcalkins May 28 '21 at 19:01

0 Answers0