2

I am trying to scrape some data from Yahoo. I have written a script which works - some of the time. Sometimes when I run the script, I am able to download the complete page - other times, the page is only partially loaded - with the data portion missing.

What is even more perplexing, is that when I navigate to that page in my browser, the entire page is shown.

Here is the gist of my code:

import dryscrape
from bs4 import BeautifulSoup

url =  'http://finance.yahoo.com/quote/SPY/options?p=SPY&straddle=false'

sess = dryscrape.Session()

sess.set_header('user-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0')

sess.set_attribute('auto_load_images', False)          
sess.set_timeout(360)

sess.visit(url)

soup = BeautifulSoup(sess.body(), 'lxml')

# Related to memory leak issue in webkit
sess.reset()

# Barfs (sometimes!) at the line below
sel_list = soup.find('select', class_='Fz(s)')

if sel_list is None or len(sel_list) == 0:
    print('element not found on page!')

I have attached images of the pages fetched below. Here is the web page, when viewed over the internet, via a web browser:

Page showing data

Now, here is the page I pulled down via a script similar to the one shown above - and it has no data!:

Page downloaded by scraping - no data!

Can anyone work out why the element is sometimes missing when the data is fetched by my script? Equally (more?) importantly, how may I fix this?

Homunculus Reticulli
  • 65,167
  • 81
  • 216
  • 341
  • It may be downloading a bunch of data using Javascript, and your script doesn't run Javascript. Try disabling Javascript in your browser and see if the browser still gets the data. – LarsH Jan 25 '17 at 20:25
  • Have you tried adding a small delay between pulling down the url and loading the source into bs4? –  Jan 25 '17 at 20:34

1 Answers1

5

You may need to wait for the data to be loaded before parsing it with BeautifulSoup. In dryscrape waiting can be done via wait_for() function:

sess.visit(url)

# waiting for the first data row in a table to be present
sess.wait_for(lambda: session.at_css("tr.data-row0"))

soup = BeautifulSoup(sess.body(), 'lxml')

Or, a shot in the dark: it might also be a temporary (network?) issue and you may workaround it by refreshing the page in a loop until you see the result, something along these lines:

from dryscrape.mixins import WaitTimeoutError 

ATTEMPTS_COUNT = 5
attempts = 0

while attempts <= ATTEMPTS_COUNT:
    sess.visit(url)

    try:
        # waiting for the first data row in a table to be present
        sess.wait_for(lambda: session.at_css("tr.data-row0"))
        break
    except WaitTimeoutError:
        print("Data row has not appeared, retrying...")
        attempts += 1

soup = BeautifulSoup(sess.body(), 'lxml')
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195