Can't scrape nested html using BeautifulSoup

Question

I have am interested in scraping "0.449" from the following source code from http://hdsc.nws.noaa.gov/hdsc/pfds/pfds_map_cont.html?Lat=33.146425&Lon=-87.5805543.

<td class="tblInner" id="0-0">
    <div style="font-size:110%">
        <b>0.449</b>
    </div>
    "(0.364-0.545)"
</td>

Using BeautifulSoup, I currently have written:

storm=soup.find("td",{"class":"tblInner","id":"0-0"})

which results in:

<td class="tblInner" id="0-0">-</td>

I am unsure of why everything nested within the td is not showing up. When I search the contents of the td, my result is simply "-". How can I scrape the value that I want from this code?

You may refer to this answer http://stackoverflow.com/questions/8960288/get-page-generated-with-javascript-in-python — Mani, Apr 26 '16 at 13:19

score 0 · Accepted Answer · edited Jun 08 '21 at 22:46

You are likely scraping a website that uses javascript to update the DOM after the initial load.

You have a couple choices:

Find out where did the javascript code that fills the HTML page got the data from and call this instead. The data most likely comes from an API that you can call directly with CURL. That's the best method 99% of the time.
Use a headless browser (zombie.js, ...) to retrieve the HTML code after the javascript changes it. Convenient and fast, but few tools in python to do this (google python headless browser).
Use selenium or splinter to remote control a real browser (chrome, firefox, ...). It's convenient and works in python, but slow as hell

Edit:

I did not see that you posted the url you wanted to scrape.

In your particular case, the data you want comes from an AJAX call to this URL:

http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds

You now only need to understand what each parameter does, and parse the output of that instead of writing an HTML scraper.

score 0 · Answer 2 · answered Apr 26 '16 at 13:25

Please excuse lack of error checking and modularity, but this should get you what you need, based on @Eloims observation:

import requests
import re

url = 'http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds'

r = requests.get(url)
response = r.text

coord_list_text = re.search(r'quantiles = (.*);', response)
coord_list = eval(coord_list_text.group(1))

print coord_list[0][0]

Can't scrape nested html using BeautifulSoup

2 Answers2