Get all td content inside tbody of tr in python using lxml

Question

I am getting values of header of html table below using lxml but when I am trying to parse the contents of the td's inside tr which is in tbody using xpath its giving me empty value because the data is generated dynamically. Below is my python code with its output value I am getting. How can I get the values?

<table id="datatabl" class="display compact cell-border dataTable no-footer" role="grid" aria-describedby="datatabl_info">
     <thead>
        <tr role="row">
              <th class="dweek sorting_desc" tabindex="0" aria-controls="datatabl" rowspan="1" colspan="1" style="width: 106px;" aria-label="Week: activate to sort column ascending" aria-sort="descending">Week</th>
   <th class="dnone sorting" tabindex="0" aria-controls="datatabl" rowspan="1" colspan="1" style="width: 100px;" aria-label="None: activate to sort column ascending">None</th>
        </tr>
     </thead>


     <tbody>
        <tr class="odd" role="row">
            <td class="sorting_1">2016-05-03</td>
            <td>4.27</td>
            <td>21.04</td>
        </tr>
       <tr class="even" role="row">
            <td class="sorting_1">2016-04-26</td>
            <td>4.24</td>
            <td>95.76</td>
           <td>21.04</td>
       </tr>
       </tbody>

My Python code

   from lxml import etree
   import urllib

   web = urllib.urlopen("http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx")
   s = web.read()

   html = etree.HTML(s)

   ## Get all 'tr'
   tr_nodes = html.xpath('//table[@id="datatabl"]/thead')
   print tr_nodes

   ## 'th' is inside first 'tr'
   header = [i[0].text for i in tr_nodes[0].xpath("tr")]
   print header

   ## tbody
   tr_nodes_content = html.xpath('//table[@id="datatabl"]/tbody')
   print tr_nodes_content

   td_content = [[td[0].text for td in tr.xpath('td')] for tr in tr_nodes_content[0]]
   print td_content

output in terminal:

    [<Element thead at 0xb6b250ac>]
    ['Week']
   [<Element tbody at 0xb6ad20cc>]
    []

Can you separate the HTML code from the Python code (and also format it correctly please) — Adib, May 06 '16 at 20:34
The data is dynamically generated, also tbody is usually added by the browser so not actually in the source, `.xpath('//table[@id="datatabl"]/thead//text()')` would get the header — Padraic Cunningham, May 06 '16 at 20:44
The issue appears to be that the table is generated via javascript. Instead you need to do the ajax call that is grabbing the data. It appears to be a POST call to `http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM`. Here's a question about how one might do it: http://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax — Ringil, May 06 '16 at 20:51

score 2 · Answer 1 · answered May 06 '16 at 21:04

The data is dynamically loaded from the http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM endpoint. One option would be to try to mimic that request and get the data from the JSON response.

Or, you can stay on a high-level and solve it via selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.maximize_window()

wait = WebDriverWait(driver, 10)

url = 'http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx'
driver.get(url)

# wait for the table to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#datatabl tr[role=row]")))
rows = driver.find_elements_by_css_selector("table#datatabl tr[role=row]")[1:]

for row in rows:
    cells = row.find_elements_by_tag_name("td")
    print(cells[2].text)

driver.close()

Prints the contents of the D0-D4 column:

33.89
39.64
39.28
39.20
...
36.74
38.45
43.61

Padraic Cunningham · Accepted Answer · 2016-05-06T21:48:55.077

This will get the data from the ajax request in json format:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36',
    'Content-Type': 'application/json',
    'Referer': 'http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx',
    'X-Requested-With': 'XMLHttpRequest',
}
import json
data = json.dumps({'area':'conus', 'type':'conus', 'statstype':'1'})   

ajax = requests.post("http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM",
                  data=data,
                  headers=headers)
from pprint import pprint as pp

pp(ajax.json())

A snippet of the output:

{u'd': [{u'D0': 33.89,
         u'D1': 14.56,
         u'D2': 5.46,
         u'D3': 3.44,
         u'D4': 1.11,
         u'Date': u'2016-05-03',
         u'FileDate': u'20160503',
         u'None': 66.11,
         u'ReleaseID': 890,
         u'__type': u'DroughtMonitorData.DmData'},
        {u'D0': 39.64,
         u'D1': 15.38,
         u'D2': 5.89,
         u'D3': 3.44,
         u'D4': 1.11,
         u'Date': u'2016-04-26',
         u'FileDate': u'20160426',
         u'None': 60.36,
         u'ReleaseID': 889,
         u'__type': u'DroughtMonitorData.DmData'},
        {u'D0': 39.28,
         u'D1': 15.44,
         u'D2': 5.94,
         u'D3': 3.44,
         u'D4': 1.11,
         u'Date': u'2016-04-19',
         u'FileDate': u'20160419',
         u'None': 60.72,
         u'ReleaseID': 888,
         u'__type': u'DroughtMonitorData.DmData'},
        {u'D0': 39.2,
         u'D1': 17.75,
         u'D2': 6.1,
         u'D3': 3.76,
         u'D4': 1.71,
         u'Date': u'2016-04-12',
         u'FileDate': u'20160412',
         u'None': 60.8,
         u'ReleaseID': 887,
         u'__type': u'DroughtMonitorData.DmData'},
        {u'D0': 37.86,
         u'D1': 16.71,
         u'D2': 5.95,
         u'D3': 3.76,
         u'D4': 1.71,
         u'Date': u'2016-04-05',
         u'FileDate': u'20160405',
         u'None': 62.14,
         u'ReleaseID': 886,
         u'__type': u'DroughtMonitorData.DmData'},

You can get all the data you want from the json returned, if you print(len(cont.json()["d"])) you will see you get 853 rows returned so you actually seem yo get all the data from the 35 pages in one go. Even if you did parse the page you would still have to do it 34 more times, getting the json from the ajax request makes it easy to parse and all from a single post.

To filter by state, we need to set the type to state and the area to CA:

data = json.dumps({'type':'state', 'statstype':'1','area':'CA'})

ajax = requests.post("http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM",
                  data=data,
                  headers=headers)
from pprint import pprint as pp

pp(ajax.json())

Again a short snippet:

{u'd': [{u'D0': 95.73,
         u'D1': 89.68,
         u'D2': 74.37,
         u'D3': 49.15,
         u'D4': 21.04,
         u'Date': u'2016-05-03',
         u'FileDate': u'20160503',
         u'None': 4.27,
         u'ReleaseID': 890,
         u'__type': u'DroughtMonitorData.DmData'},
        {u'D0': 95.76,
         u'D1': 90.09,
         u'D2': 74.37,
         u'D3': 49.15,
         u'D4': 21.04,
         u'Date': u'2016-04-26',
         u'FileDate': u'20160426',
         u'None': 4.24,
         u'ReleaseID': 889,
         u'__type': u'DroughtMonitorData.DmData'},

Which you will see matches what is displayed on the page.

Hi @Padraic Cunningham if you see the web page it can also filter the data based on states and in my case I want only California data, so I can get the data based on state? — shanky, May 06 '16 at 21:38
No worries, what is most important is `'Content-Type': 'application/json'` and passing the data as json or you will get the form html back if you don't pass ``'Content-Type': 'application/json'` and an error if you do pass it and don't pass the data as json — Padraic Cunningham, May 06 '16 at 21:54
One more thing I am able to change unicode string to regular however how can I change keys inside curly bracket to regular one like u'D0' to 'D0'? — shanky, May 06 '16 at 21:57
@Shanky, that is just a unicode u, if you really want str then just encode to utf-8 but `d["D0"]` will work just the same as `d[u"D0"]` — Padraic Cunningham, May 06 '16 at 21:58
I wanted to change because I will be importing it to excel or csv so I hope it won't bother me while exporting. — shanky, May 06 '16 at 22:01
@Shanky, `csv.writer(open("foo.csv", "w")).writerows(map(itemgetter("Date","D0", "D1", "D2", "D3"), ajax.json()["d"]))` will write it no problem — Padraic Cunningham, May 06 '16 at 22:10

Get all td content inside tbody of tr in python using lxml

2 Answers2