Trying to scrape table using Pandas from Selenium's result

Question

I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in commented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?
filter=BS02'

page = driver.get(url)
time.sleep(2)


driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()


driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)

driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)

target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
##    print (data.text)

for data in target:
    dfs = pd.read_html(target,match = '+')
for df in dfs:
    print (df)

Running the above script, i get the below error:

Traceback (most recent call last):
  File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
    dfs = pd.read_html(target,match = '+')
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
    compiled_match = re.compile(match)  # you can pass a compiled regex here
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in compile
    return _compile(pattern, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
    source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0

I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1.

ksai · Answer 1 · 2017-08-01T04:26:12.317

You can get the table using the following code

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS02'

page = driver.get(url)
time.sleep(2)

df = pd.read_html(driver.page_source)[0]
print(df.head())

This is the output

No  Code    Name    Rem Last Done   LACP    Chg % Chg   Vol ('00)   Buy Vol ('00)   Buy Sell    Sell Vol ('00)  High    Low
0   1   5284CB  LCTITAN-CB  s   0.025   0.020   0.005   +25.00  406550  19878   0.020   0.025   106630  0.025   0.015
1   2   1201    SUMATEC [S] s   0.050   0.050   -   -   389354  43815   0.050   0.055   187301  0.055   0.050
2   3   5284    LCTITAN [S] s   4.470   4.700   -0.230  -4.89   367335  430 4.470   4.480   34  4.780   4.140
3   4   0176    KRONO [S]   -   0.875   0.805   0.070   +8.70   300473  3770    0.870   0.875   797 0.900   0.775
4   5   5284CE  LCTITAN-CE  s   0.130   0.135   -0.005  -3.70   292379  7214    0.125   0.130   50  0.155   0.100

To get data from all pages you can crawl the remaining pages and use df.append

Thank you very much for pointing out the solution. Your suggestion works great! Do you mind to explain what the [0] is for in read_html? I tried searching for it in the read_html documentation but couldn't find any explanation. — Eric Choi, Aug 01 '17 at 08:17
Because two tables are returned and what you want is the first table. You can see the two different tables by `df[0]` and `df[1]` — ksai, Aug 01 '17 at 08:20
@EricChoi I recommend you to read about `pd.read_html()`, it returns a list of dataframes. — ksai, Aug 01 '17 at 12:51

score 5 · Answer 2 · edited Jan 24 '20 at 14:32

Answer:

df = pd.read_html(target[0].get_attribute('outerHTML'))

Result:

Reason for target[0]:

driver.find_elements_by_id('bm_equities_prices_table') returns a list of selenium webelements, in your case, there's only 1 element, hence [0]

Reason for get_attribute('outerHTML'):

we want to get the 'html' of the element. There are 2 types of such get_attribute methods: 'innerHTML' vs 'outerHTML'. We chose the 'outerHTML' becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.

Reason for df[0]

pd.read_html() returns a list of data frames, the first of which is the result we want, hence [0].

Trying to scrape table using Pandas from Selenium's result

2 Answers2

Linked