Because HTML is not a flat-text format, handling it with flat-text tools such as grep
, sed
or awk
is not advisable. If the format of the HTML changes slightly (for example: if the span
node gets another attribute or newlines are inserted somewhere), anything you build this way will have a tendency to break.
It is more robust (if more laborious) to use something that is built to parse HTML. In this case, I'd consider using Python because it has a (rudimentary) HTML parser in its standard library. It could look roughly like this:
#!/usr/bin/python3
import html.parser
import re
import sys
# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
def __init__(self):
super(my_parser, self).__init__(self)
self.data = ''
self.depth = 0
# handle opening tags. Start counting, assembling content when a
# span tag begins whose id is "wob_hm". A depth counter is maintained
# largely to handle nested span tags, which is not strictly necessary
# in your case (but will make this easier to adapt for other things and
# is not more complicated to implement than a flag)
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
# handle end tags. Make sure the depth counter is only positive
# as long as we're in the span tag we want
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
# when data comes, assemble it in a string. Note that nested tags would
# not be recorded by this if they existed. It would be more work to
# implement that, and you don't need it for this.
def handle_data(self, data):
if self.depth > 0:
self.data += data
# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
# instantiate our parser
p = my_parser()
# then feed it the file. If the file is not UTF-8, it is necessary to
# convert the file contents to UTF-8. I'm assuming latin1-encoded
# data here; since the example looks German, "latin9" might also be
# appropriate. Use the encoding in which your data is encoded.
p.feed(f.read().decode("latin1"))
# trim (in case of newlines/spaces around the data), remove % at the end,
# then print
print(re.compile('%$').sub('', p.data.strip()))
Addendum: Here's a backport to Python 2 that bulldozes right over encoding problems. For this case, that is arguably nicer because encoding doesn't matter for the data we want to extract and you don't have to know the encoding of the input file in advance. The changes are minor, and the way it works is exactly the same:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class my_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
self.depth = 0
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
def handle_data(self, data):
if self.depth > 0:
self.data += data
with open(sys.argv[1], "r") as f:
p = my_parser()
p.feed(f.read())
print(re.compile('%$').sub('', p.data.strip()))