grep/sed/awk - extract substring from html code

Question

i want to get a value from html code like this:

<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:

As result i need just the value: "53"

How can this be done using linux command line tools like grep, awk or sed? I want to use it on a raspberry pi ...r

Trying this doesnt work:

root@raspberrypi:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt
root@raspberrypi:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt
root@raspberrypi:/home/pi#

Would you be open to a solution using a proper HTML parser? This is possible to do using a regex but you're a lot better off learning to use something like perl/python to solve these kinds of problems. — Tom Fenech, Apr 04 '15 at 17:45
Obligatory [don't parse (x)html with regex](http://stackoverflow.com/a/1732454/7552) link. — glenn jackman, Apr 04 '15 at 17:57

Wintermute · Answer 1 · 2015-04-05T01:32:20.033

Because HTML is not a flat-text format, handling it with flat-text tools such as grep, sed or awk is not advisable. If the format of the HTML changes slightly (for example: if the span node gets another attribute or newlines are inserted somewhere), anything you build this way will have a tendency to break.

It is more robust (if more laborious) to use something that is built to parse HTML. In this case, I'd consider using Python because it has a (rudimentary) HTML parser in its standard library. It could look roughly like this:

#!/usr/bin/python3

import html.parser
import re
import sys

# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
    def __init__(self):
        super(my_parser, self).__init__(self)
        self.data  = ''
        self.depth = 0

    # handle opening tags. Start counting, assembling content when a
    # span tag begins whose id is "wob_hm". A depth counter is maintained
    # largely to handle nested span tags, which is not strictly necessary
    # in your case (but will make this easier to adapt for other things and
    # is not more complicated to implement than a flag)
    def handle_starttag(self, tag, attrs):
        if tag == 'span':
            if ('id', 'wob_hm') in attrs:
                self.data = ''
                self.depth = 0
            self.depth += 1

    # handle end tags. Make sure the depth counter is only positive
    # as long as we're in the span tag we want
    def handle_endtag(self, tag):
        if tag == 'span':
            self.depth -= 1

    # when data comes, assemble it in a string. Note that nested tags would
    # not be recorded by this if they existed. It would be more work to
    # implement that, and you don't need it for this.
    def handle_data(self, data):
        if self.depth > 0:
            self.data += data

# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
    # instantiate our parser
    p = my_parser()

    # then feed it the file. If the file is not UTF-8, it is necessary to
    # convert the  file contents to UTF-8. I'm assuming latin1-encoded
    # data here; since the example looks German, "latin9" might also be
    # appropriate. Use the encoding in which your data is encoded.
    p.feed(f.read().decode("latin1"))

    # trim (in case of newlines/spaces around the data), remove % at the end,
    # then print
    print(re.compile('%$').sub('', p.data.strip()))

Addendum: Here's a backport to Python 2 that bulldozes right over encoding problems. For this case, that is arguably nicer because encoding doesn't matter for the data we want to extract and you don't have to know the encoding of the input file in advance. The changes are minor, and the way it works is exactly the same:

#!/usr/bin/python

from HTMLParser import HTMLParser
import re
import sys

class my_parser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data  = ''
        self.depth = 0

    def handle_starttag(self, tag, attrs):
        if tag == 'span':
            if ('id', 'wob_hm') in attrs:
                self.data = ''
                self.depth = 0
            self.depth += 1

    def handle_endtag(self, tag):
        if tag == 'span':
            self.depth -= 1

    def handle_data(self, data):
        if self.depth > 0:
            self.data += data

with open(sys.argv[1], "r") as f:
    p = my_parser()
    p.feed(f.read())
    print(re.compile('%$').sub('', p.data.strip()))

Thx for the answer, but trying this i get: root@raspberrypi:/home/pi/grep-google-weather# ./test.py test.html Traceback (most recent call last): File "./test.py", line 46, in p.feed(f.read()) File "/usr/lib/python3.2/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 16022: invalid start byte — fhammer, Apr 05 '15 at 01:04
iso8859-encoded data, eh? See the edit. It's a small change at the end to decode the file contents before passing them to html.parser.HTMLParser, which apparently requires UTF-8 in Python 3. I might come back to port it to Python 2 later, which I think handles this more gracefully, but I need sleep before that. — Wintermute, Apr 05 '15 at 01:25
Eh, I made the python 2 backport right away. Turned out to require very nearly no changes, and the python 2 `HTMLParser` has the (for this case) nice property of not caring about encoding. I'm a bit annoyed that that was removed without replacement in python 3, to be honest. — Wintermute, Apr 05 '15 at 01:35
Thx, that helped a lot. Now i try to fetch the wanted conted! — fhammer, Apr 06 '15 at 11:41

grep/sed/awk - extract substring from html code

1 Answers1