Python noob here so I'll try to provide as much detail as I can. I'm experimenting with Python's concurrent futures module to see if I can speed up some scraping using Selenium. I'll scrape some financial data from a site using the following URLs in a csv file titled "inputURLS.csv." We'll keep the list of stocks short and have one fake stock to deal with an exception. The actual URL csv is longer so I'd like to try to pull from a csv rather than type out an array in my python script.
https://www.benzinga.com/quote/TSLA
https://www.benzinga.com/quote/AAPL
https://www.benzinga.com/quote/XXXX
https://www.benzinga.com/quote/SNAP
Here is my python code to extract 3 items of data - share number, market cap, and PE ratio. The script works fine outside of concurrent futures.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import csv
import concurrent.futures
from random import randint
from time import sleep
options = webdriver.ChromeOptions()
#options.add_argument("--headless") #optional headless
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ['enable-automation'])
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options, executable_path=r'D:\SeleniumDrivers\Chrome\chromedriver.exe')
driver.execute_cdp_cmd('Network.setUserAgentOverride',{"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'})
OutputFile = open('CSVoutput.csv', 'a')
urlList = []
with open('inputURLS.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
urlList.append(row[0])
print (urlList) #make array visible in viewer
def extract(theURLS):
for i in urlList:
driver.get(i)
sleep(randint(3, 10)) # random pause
try:
bz_shares = driver.find_element_by_css_selector('div.flex:nth-child(10) > div:nth-child(2)').text #get shares number
print(bz_shares) # to see in viewer
OutputFile.write(bz_shares) # save number to csv output
except NoSuchElementException:
print("N/A") # print N/A if stock does not exist
OutputFile.write("N/A") # save non value to csv output
try:
bz_MktCap = driver.find_element_by_css_selector('div.flex:nth-child(5) > div:nth-child(2)').text #get market cap
print(bz_MktCap) # to see in viewer
OutputFile.write("," + bz_MktCap) # save market cap to csv output
except NoSuchElementException:
print("N/A") # print N/A if no value
OutputFile.write(",N/A") # save non value to csv output
try:
bz_PE = driver.find_element_by_css_selector('div.flex:nth-child(8) > div:nth-child(2)').text #get PE ratio
print(bz_PE) # to see in viewer
OutputFile.write("," + bz_PE) # save PE ratio to csv output
except NoSuchElementException:
print("N/A") # print N/A if no value
OutputFile.write(",N/A") # save non value to csv output
print(driver.current_url) # see URL screen in viewer
OutputFile.write("," + driver.current_url + "\n") # save URL to csv output
return theURLS
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(extract, urlList)
When I run the script I get the following results to my ouput file:
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
963.3M,602.9B,624.6,https://www.benzinga.com/quote/TSLA
So the script is looping through my csv file but it's stuck on the first row. I get 4 rows of data back - which is the number of URLs I'm starting with - but I only get data back for the first URL. If I had 8 URLs, the same thing happens 8 times, etc. I don't think I'm looping correctly through the URLlist array in my function. Would appreciate any assistance to fix this. I put this together with various sites and youtube videos I've watched on concurrent futures but I'm totally stuck. Thanks so much!