0

I'm trying to web scrape information from an HTML table that has interactive ability to sift through various time periods. An example table is located at this URL: http://quotes.freerealtime.com/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750.

I'd like to start at the time of 9:30 and then interact with the table by jumping forward 1 min. I want to export all of the data to a DataFrame. I've tried using pandas.read_html() and also tried using BeautifulSoup. Neither of these are working for me albeit I am inexperienced with BeautifulSoup. Is my request possible or has the website protected this information from web scraping? Any help would be appreciated!

Evy555
  • 215
  • 2
  • 9
  • 19

2 Answers2

1

The page is quite dynamic (and terribly slow, at least on my side), involves JavaScript and multiple asynchronous requests to get the data. Approaching that with requests would not be easy and you might need to fall into using browser automation via, for example, selenium.

Here is something for you to get started. Note the use of Explicit Waits here and there:

import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.maximize_window()
driver.get("http://quotes.freerealtime.com/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750")

wait = WebDriverWait(driver, 400)  # 400 seconds timeout

# wait for select element to be visible
time_select = Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[name=time]"))))

# select 9:30 and go
time_select.select_by_visible_text("09:30")
driver.execute_script("arguments[0].click();", driver.find_element_by_id("go"))
time.sleep(2)

while True:
    # wait for the table to appear and load to pandas dataframe
    table = wait.until(EC.presence_of_element_located((By.ID, "qmmt-time-and-sales-data-table")))
    df = pd.read_html(table.get_attribute("outerHTML"))
    print(df[0])

    # wait for offset select to be visible and forward it 1 min
    offset_select = Select(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "select[name=timeOffset]"))))
    offset_select.select_by_value("1")

    time.sleep(2)

    # TODO: think of a break condition

Note that this works really, really slow on my machine and I am not sure how well it would run on yours, but it continuously advances 1 minute forward in an endless loop (you would probably need to stop it at some point).

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you! I am having an error while running this. Message: 'geckodriver' executable needs to be in PATH – Evy555 Jan 11 '17 at 22:21
  • @Evy555 yeah, that's a [common problem with current selenium/firefox](http://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path). – alecxe Jan 11 '17 at 22:52
0

This page is rendered by JavaScript, if you disable the JS in your browser, the output of this page is:

enter image description here

requests or pandas only handle the HTML code.

Community
  • 1
  • 1
宏杰李
  • 11,820
  • 2
  • 28
  • 35