Don't use regular expressions for this at all. Convert from HTML to XML, and use XPath -- a query language that works on document semantics, as opposed to mere text-matching:
url="http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint="
curl "$url" | \
tidy -asxml -n -c -b -q --show-warnings no | \
xmlstarlet sel -N h=http://www.w3.org/1999/xhtml \
-t -m '//h:tr[h:td]' \
-v ./h:td[1] -o $'\t' \
-v ./h:td[2] -o $'\t' \
-v ./h:td[4] -o $'\t' \
-v ./h:td[5] -n | \
column -s $'\t' -t
For the given input HTML, as of today, the output is:
5 Vestli via Majorstuen nå 1
4 Vestli via Storo 2 min 2
5 Ringen via Majorstuen 4 min 1
5 Sognsvann 7 min 2
4 Bergkrystallen via Majorstuen 10 min 1
5 Ringen via Storo 12 min 2
The tools used here are:
- HTML Tidy (for converting messy HTML into compliant XHTML)
- XMLStarlet (for performing XPath queries)
- column (for formatting the output into aligned columns)
Note also that $'\t'
syntax requires that the shell in use really be bash (not /bin/sh
).