0

I'm having problems parsing table data with BeautifulSoup, though I've tried many solutions found here, here, and here. I hate to re-ask but maybe my issue is unique and that is why the above solutions haven't worked, or I'm just an idiot.

So I'm trying to retrieve the flood triggers for any given river from water.weather.gov. I'm using the Mississippi river data because it has the most active measuring stations. Each station has 4 stage triggers that I am trying to obtain: Action, Flood, Moderate, and Major. I have actually been able to extract the table data for those catagories when there are numerical values, however in cases where the table data is "Not Available" the row is skipped, so that when I put the values in the correct stage they are not aligned with the appropriate station trigger.

The table data that I'm trying to extract looks like this:

<div class="box_square">        <b><b>Flood Categories (in feet)</b><br>
</b>
        <table width="150" cellspacing="0" cellpadding="0" border="0">
        <tbody>
            <tr><td nowrap="">Not Available</td></tr>
        </tbody>

<div class="box_square">        <b><b>Flood Categories (in feet)</b><br>
</b>
        <table width="150" cellspacing="0" cellpadding="0" border="0">
        <tbody>
            <tr style="display:'';line-height:20px;background-color:#CC33FF;color:black">
                <td scope="col" nowrap="">Major Flood Stage:</td>
                <td scope="col">18</td>
            </tr>
            <tr style="display:'';line-height:20px;background-color:#FF0000;color:white">
                <td scope="col" nowrap="">Moderate Flood Stage:</td>
                <td scope="col">15</td>
            </tr>
            <tr style="display:'';line-height:20px;background-color:#FF9900;color:black">
                <td scope="col" nowrap="">Flood Stage:</td>
                <td scope="col">13</td>
            </tr>
            <tr style="display:'';line-height:20px;background-color:#FFFF00;color:black">
                <td scope="col" nowrap="">Action Stage:</td>
                <td scope="col">12</td>
            </tr>
            <tr style="display:none;line-height:20px;background-color:#906320;color:white">
                <td scope="col" nowrap="">Low Stage (in feet):</td>
                <td scope="col">-9999</td>
            </tr>
        </tbody>
        </table><br></div>

The last Low Stage isn't necessary and I have filtered it out. Here is the code that I have that will populate alert_list with the appropriate values, but without the necessary Not Available:

alert_list = []
alert_values = []
alerts = soup.findAll('td', attrs={'scope':'col'})
for alert in alerts:
    alert_list.append(alert.text.strip()) 

a_values = alert_list[1::2]
alert_list.clear()
major_lvl = a_values[::5]
moderate_lvl = a_values[1::5]
flood_lvl = a_values[2::5]
action_lvl = a_values[3::5]

and the results:

>>> major_lvl
['18', '26', '0', '11', '0', '17', '17', '18', '0', '683', '16', '0', '20', '16', '18', '665', '661', '18', '651', '645', '15.5', '636', '20', '631', '22', '21', '20.5', '21.5', '20', '20', '20.5', '13.5', '18', '18', '20', '18.5', '17', '14', '18', '19', '25', '25', '25', '26', '25', '24', '22', '25', '33', '34', '29', '34', '40', '40', '0', '0', '0', '42', '42', '0', '0', '0', '0', '0', '44', '47', '43', '35', '46', '52', '55', '0', '44', '57', '50', '57', '64', '40', '34', '26', '20']

I just noticed actually that the reason the Not Available tag isn't getting scraped is because it's under the tr tag, not td. How do I add this so that my values line up?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Aaron Nelson
  • 181
  • 14

2 Answers2

1

If you are only interested in those column where scope=col, you can use a css selector to do this beautifully.

In [24]: soup = BS(html, "html.parser")

In [25]: major_list = [td.get_text(strip=True) for td in soup.select("tr > td:nth-of-type(2)[scope=col]")[:-1]]

In [26]: major_list
Out[26]: ['18', '15', '13', '12']

To get all the rows alongside their column, you need to select the rows first and for each row retrieve the data in the column.

for tr in soup.select("div[class=box_square] tr"):
    print([td.get_text(strip=True) for td in tr.find_all("td")])
styvane
  • 59,869
  • 19
  • 150
  • 156
  • I like your solution for its simplicity, but for some reason I could not produce the same results: `['', 'Home', '18', '15', '13', '12']` and also I'm still unsure how to include the **Not Available** tag from the stations that aren't reporting data. Thanks! – Aaron Nelson May 31 '17 at 14:45
  • Thanks for the edit. That worked perfectly too so I up-voted. I gave Bill the solution vote since it will be easier to assign the **not available** tag to the correct station easier with the way my code is currently structured. Or at least it appears that way to me. I still appreciate your help! – Aaron Nelson May 31 '17 at 21:01
1

You can also do it with a function. In your case, only the rows that you want have the style attribute. You can spin through all of the tags and accept only those that are tr and that have style.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('weather.htm'), 'lxml')
>>> def acceptable(tag):
...     return tag.name=='tr' and tag.has_attr('style')
... 
>>> for tag in soup.find_all(acceptable):
...     tag.text.replace('\n', '').split(':')
...     
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']

Edit, in response to to comment:

Omit acceptable and use this.

>>> for tag in soup.find_all('tr'):
...     if tag.has_attr('style'):
...         tag.text.replace('\n', '').split(':')
...     elif 'not available' in tag.text.lower():
...         tag.text
...     else:
...         pass
...     
'Not Available'
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • I really like this solution since I will be working with multiple rivers, which each have their own stages and alerts. However your solution still doesn't return the **Not Available** value in stations that aren't reporting data. I included the link for the Mississippi River in the OP that I am scraping to help visualize the Stage alerts better. Not all stations are reporting data, and this is where I'm having trouble. – Aaron Nelson May 31 '17 at 14:48
  • Excellent! Thanks Bill for your help! So I'm actually teaching myself Python and I'm doing this as a project. I was wondering if the `has.attr` and `text.replace` syntax are part of the **BeautifulSoup** library or a built-in. Could you offer some documentation that I could look over to learn more how and when to use them in the future. Thanks again for the help! – Aaron Nelson May 31 '17 at 15:34
  • 1
    We're all 'learning Python'. (Trust me.) Anyway, `has.attr` is part of BeautifulSoup and `text.replace` is pure Python. To work with BeautifulSoup you should be fully conversant with Python strings and its regex module. I simply read (and re-read and re-read) the accompanying documentation for these. For BS there's https://www.crummy.com/software/BeautifulSoup/bs4/doc/ which I find pretty bad, I must say, and the innumerable examples here on SO. – Bill Bell May 31 '17 at 16:03