1

I'd like to create a script that grabs two values from this awful HTML published on a city website:

558.35

and

66.0

These are water reservoir details and change weekly.

I'm unsure what the best tool to do this is, grep?

Thanks for your suggestions, ideas!

<table>
    <tbody>
        <tr>
            <td>&nbsp;Currently:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 558.35</td>
        </tr>
        <tr>
            <td>&nbsp;Percent of capacity:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;66.0%</td>
        </tr>
    </tbody>
</table>
Dan
  • 931
  • 2
  • 18
  • 31
  • If you use PHP then you could use DOMDocument. – StackSlave Dec 23 '15 at 03:38
  • Is this a skill you hope to improve on? Then learn python-scrapy, beautifulSoup and others. Python has a healthy eco-system for web scraping, but as html gets more baroque, you'll have to keep that skill up-to-date to be meaningful. If you're looking just to grab those 2 values and won't be doing anything else for years, the post a bounty for an `xmllint` or `xmlstarlet` solution. If its really this simple, you might also find an `awk` solution, but once that data proves more complex than what you've indicated here, all bets are off ;-) Good luck. – shellter Dec 23 '15 at 03:43
  • Thanks, these are solutions I'll explore! – Dan Dec 23 '15 at 03:49
  • 1
    Regular expression is the worst tool to parse/scrape HTML. you may be interested in [this link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – bansi Dec 23 '15 at 03:56

1 Answers1

2

if you are using regex you can use sed

sed -nr 's#^[ ]*<td>.*;[ ]?([0-9]+[.][0-9]+)[%]?</td>[ ]*$#\1#p' my_html_file

An Htmlparser such as python's module BeautifulSoup or a javascript approach is a safer choice

EDIT:

Here is a snippet using javascript..results is logged to the console and an alert box pops up to show results

var values="";
for(i=1;i<document.getElementsByTagName('td').length;++i){
values+=" "+document.getElementsByTagName('td')[i].innerHTML.replace(/&nbsp;|Percent of capacity:|[ %]/g,"")
}
alert(values);
console.log(values);
<table>
    <tbody>
        <tr>
            <td>&nbsp;Currently:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 558.35</td>
        </tr>
        <tr>
            <td>&nbsp;Percent of capacity:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;66.0%</td>
        </tr>
    </tbody>
</table>
repzero
  • 8,254
  • 2
  • 18
  • 40