pandas.read_html returning incomplete table data/without scrolling to the bottom data

Question

pandas.read_html only returns the table data which is present on the not-scrolled HTML page. So the table data, which would have been returned, with scrolling, is not in the list of data frames returned. How do I get it to return the list of data frames only after following the given steps:

Scroll to the bottom
Wait for the content to load
If content is no more loading, then return
Go to step 1

My Code:

import pandas as pd

url = 'https://finance.yahoo.com/quote/GOOG/history?period1=1566844200&period2=1598466600&interval=1d&filter=history&frequency=1d'

dfs = pd.read_html(url)

print(dfs[0])

Actual Result:

    Date            Open    High    Low    Close*  Adj Close**  Volume
0   Aug 26, 2020    1608.00 1659.22 1603.60 1652.38 1652.38 3993400
1   Aug 25, 2020    1582.07 1611.62 1582.07 1608.22 1608.22 2247100
2   Aug 24, 2020    1593.98 1614.17 1580.57 1588.20 1588.20 1409900
3   Aug 21, 2020    1577.03 1597.72 1568.01 1580.42 1580.42 1446500
4   Aug 20, 2020    1543.45 1585.87 1538.20 1581.75 1581.75 1706900
...          ...        ...     ...     ...     ...     ...     ...
96  Apr 09, 2020    1224.08 1225.57 1196.73 1211.45 1211.45 2175400
97  Apr 08, 2020    1206.50 1219.07 1188.16 1210.28 1210.28 1975100
98  Apr 07, 2020    1221.00 1225.00 1182.23 1186.51 1186.51 2387300
99  Apr 06, 2020    1138.00 1194.66 1130.94 1186.92 1186.92 2664700
100         *CPA       *CPA    *CPA    *CPA    *CPA    *CPA    *CPA

[101 rows × 7 columns]

Expected Result:

    Date            Open    High    Low    Close*  Adj Close**  Volume
0   Aug 26, 2020    1608.00 1659.22 1603.60 1652.38 1652.38 3993400
1   Aug 25, 2020    1582.07 1611.62 1582.07 1608.22 1608.22 2247100
2   Aug 24, 2020    1593.98 1614.17 1580.57 1588.20 1588.20 1409900
3   Aug 21, 2020    1577.03 1597.72 1568.01 1580.42 1580.42 1446500
4   Aug 20, 2020    1543.45 1585.87 1538.20 1581.75 1581.75 1706900
...          ...        ...     ...     ...     ...     ...     ...
249 Apr 30, 2019    1224.08 1225.57 1196.73 1211.45 1211.45 2175400
250 Apr 29, 2019    1206.50 1219.07 1188.16 1210.28 1210.28 1975100
251 Apr 27, 2019    1221.00 1225.00 1182.23 1186.51 1186.51 2387300
252 Aug 26, 2019    1138.00 1194.66 1130.94 1186.92 1186.92 2664700
253         *CPA       *CPA    *CPA    *CPA    *CPA    *CPA    *CPA

[253 rows × 7 columns]

The following should solve you issue: https://stackoverflow.com/questions/39218742/using-beautifulsoup-to-search-through-yahoo-finance — BStadlbauer, Oct 26 '20 at 19:59

score 1 · Answer 1 · answered Oct 26 '20 at 20:10

Pandas method read HTML only loads the HTML that is populated at the beginning, You will have to use something like selenium which actually does it by opening the page in a real browser and then you can make the instance scroll down until you get all the data.

Something like this should help:

from selenium import webdriver
import time
browser=webdriver.Firefox()
browser.get("urlhere")
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

This will load the entire page and then you can fetch entries from it using basic selium code like

elems= browser.find_elements_by_class_name("thelementsyouwant")

score 0 · Answer 2 · answered Oct 26 '20 at 20:02

0

If all you're trying to pull is price data I'd recommend using yfinance.

import yfinance as yf

goog = yf.Ticker("GOOG")

hist = goog.history(period="max")

answered Oct 26 '20 at 20:02

apw-ub

86
1
5

pandas.read_html returning incomplete table data/without scrolling to the bottom data

2 Answers2