python webscraping fail in loop but works when i do it manually

Question

I was trying to collect some data from web programmatically for 6000 stocks, i used Python 3.6 selenium webdriver Firefox. [I intended to use BeautifulSoup to parse the HTML but it seems every-time when I update the web, the link doesn't change, soup doesn't cope with Javascript]

Anyway, When I create a for loop to do this, a specific row in my code, share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)"), goes wrong most of the time (It worked a couple times though, So i believe my code is good). However, it works fine if I did it manually (copy and paste into Python IDLE and run it). I tried to use time.sleep(4) to allow web to load before I salvage anything from background, but it seems this is not the solution. Now I'm running out of hint. Can anyone help me unravel this.

Below is my code:

 from selenium import webdriver
 import time
 import pyautogui
 filename = "historical_price_marketcap.csv"
 f = open(filename,"w")
 headers = "stock_ticker, share_price, market_cap\n"
 f.write(headers)
 driver = webdriver.Firefox()
 def get_web():
     driver.get("https://stockrow.com")
 import csv
 with open("TICKER.csv") as file:
        read = csv.reader(file)
        TICKER=[]
        for row in read:
                ticker = row[0][1:-1]
                TICKER.append(ticker)
for Ticker in range(len(TICKER)):
    get_web()
    time.sleep(3)
    pyautogui.click(425, 337)
    pyautogui.typewrite(TICKER[Ticker],0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)
    pyautogui.click(268, 337)
    pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite('Stock Price',0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(702, 427)
    for i in range(int(10)):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-01",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(882, 425)
    for k in range(10):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-31",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(1317, 318)
    for j in range(3):
            pyautogui.press("down")

    time.sleep(10)
    share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")
    get_web()
    time.sleep(3)
    pyautogui.click(425, 337)
    pyautogui.typewrite(TICKER[Ticker],0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)
    pyautogui.click(268, 337)
    pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite('Market Cap',0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(702, 427)
    for i in range(int(10)):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-01",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(882, 425)
    for k in range(10):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-31",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(1317, 318)
    for j in range(3):
            pyautogui.press("down")

    time.sleep(10)
    market_cap = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(28) > text:nth-child(2)")
 f.close()

it seems that the two lines that is bugging me is share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)") Here is the error message from Python:

 Traceback (most recent call last):
  File "C:\Users\HENGBIN\Desktop\get_historical_data.py", line 65, in <module>
    share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 457, in find_element_by_css_selector
    return self.find_element(by=By.CSS_SELECTOR, value=css_selector)
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 791, in find_element
    'value': value})['value']
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 256, in execute
    self.error_handler.check_response(response)
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .highcharts-root > g:nth-child(25) > text:nth-child(2)

It doesn't work most of the time in loop but works fine if I run it manually in Python IDLE. I don't know what is going on.........

jlaur · Answer 1 · 2018-07-03T10:29:09.363

There are several things in your script, that I'd do differently. First of all - try to get rid of pyautogui. Selenium has build in functions for clicking (check out this SO-question) and for sending all sorts of keys (check out this SO-question). Also when you change the content in the browser (with pyautogui) my experience is that selenium will not always be aware of these changes. That could explain you issues with regards to finding the elements pyautogui created when searching for them with selenium.

Secondly: Your get_web()-function could cause problems. Generally speaking content inside a function has to be returned - or declared global - to be accessible outside the function. The driver, that opens your webpage is global (you instantiate it outside the function), but the url inside the function is local meaning you could have problems accessing the content outside the function. I'd recommend that you get rid of the function (as it really doesn't do anything besides opening the url) and simply just replace the function call in your code like so:

for Ticker in range(len(TICKER)):
    driver.get("https://stockrow.com")
    time.sleep(3)
    # insert keys, click and so on...

This should make it possible for you to use seleniums driver.find_elements...-methods.

Thirdly: I assume that you'd like to extract some data from the site as well. If so, do the parsing with something else than selenium. Selenium is a slow parser. You could try BeautifulSoup instead.

Once the site is loaded you load the html in BeautifulSoup and extract whatever you want (there a SO-question here, that'll show you how you go about that)

from bs4 import BeautifulSoup
.....
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
element_you_want_to_retrieve = soup.find('tag_name', attrs={'key': 'value'})

But with this site what you really should do was tap into the api call the site makes on its own. Use Chromes inspector tool. You'll see that it queries three API's that you can call directly and avoid the whole selenium thing.

the url for apple looks like this:

url = 'https://stockrow.com/api/fundamentals.json?indicators[]=0&tickers[]=APPL'

So with the requests library you could retrieve the content as json like so:

import requests
from pprint import pprint
url = 'https://stockrow.com/api/fundamentals.json?indicators[]=0&tickers[]=AAPL'
response = requests.get(url).json()
pprint(response)

This is a much faster solution than selenium.

beautifulsoup is probably not a good choice as the web is using JavaScript. — Hengbin Zhang, Jul 03 '17 at 08:46

python webscraping fail in loop but works when i do it manually

1 Answers1