Extracting data from html in bash script

Question

I want to get the data (line, destination, time and pos) displayed at this subway schedule page.

The code I wrote for now is :

#!/bin/bash
curl "http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint=">ruter.html
awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][0-9]/ {print $3, $5, $8, $10 }' ruter.html

I did change a little on this code on awk part,as: awk -F "*td>|*tr>" '/<\/*t[rd]>.*/ {print $1, $2, $4, $5 }' ruter1.html and worked out some result like: Bergkrystallen via Majorstuen Ringen via Storo 1 min 2 ........(so on the 6 group) seems still messy and with html tags. But I know little about awk, I can't improve it. So I would rather user a loop or something make it easier to understand. — Lu Liu, Sep 07 '16 at 15:15
Another quetion is, that I also want to print only the one row(first subway time) which the Pro=1 or Pro=2 (according to the flag with the command line,for example if I give the command ./subway.sh -W, it give the latest subway time of platform1 or if I give the command ./subway.sh -E , it give the latest subway time of platform2) — Lu Liu, Sep 07 '16 at 15:16

score 3 · Answer 1 · edited May 23 '17 at 12:08

Don't use regular expressions for this at all. Convert from HTML to XML, and use XPath -- a query language that works on document semantics, as opposed to mere text-matching:

url="http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint="

curl "$url" | \
  tidy -asxml -n -c -b -q --show-warnings no | \
  xmlstarlet sel -N h=http://www.w3.org/1999/xhtml \
    -t -m '//h:tr[h:td]' \
    -v ./h:td[1] -o $'\t' \
    -v ./h:td[2] -o $'\t' \
    -v ./h:td[4] -o $'\t' \
    -v ./h:td[5] -n | \
  column -s $'\t' -t

For the given input HTML, as of today, the output is:

5  Vestli via Majorstuen          nå      1
4  Vestli via Storo               2 min   2
5  Ringen via Majorstuen          4 min   1
5  Sognsvann                      7 min   2
4  Bergkrystallen via Majorstuen  10 min  1
5  Ringen via Storo               12 min  2

The tools used here are:

HTML Tidy (for converting messy HTML into compliant XHTML)
XMLStarlet (for performing XPath queries)
column (for formatting the output into aligned columns)

Note also that $'\t' syntax requires that the shell in use really be bash (not /bin/sh).

Thank you so much! But I tried as u write, didn't get the result... should I also install something? Another trouble is, that I need to print only the one row(latest one) which the Pro=1 or Pro=2 (according to the flag with the command line,for example if I give the command ./subway.sh -W or ./subway.sh -E) — Lu Liu, Sep 07 '16 at 14:36
You do need tidy and xmlstarlet installed -- if either isn't, there should be a self-explanatory error on stderr. As for filtering for a specific platform, you can make it `-m "//h:tr[h:td[5] = '1']"` or `'2'` as appropriate. — Charles Duffy, Sep 07 '16 at 16:34

score 3 · Answer 2 · answered Sep 06 '16 at 17:52

3

With links:

links -dump 'http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint='

Output:

   Linje Destinasjon                     Tid    Pos 
   Line  Destination                     Time   Pos 
   4     Vestli via Storo                3 min  2   
   5     Vestli via Majorstuen           3 min  1   
   5     Ringen via Majorstuen           5 min  1   
   5     Sognsvann                       11 min 2   
   4     Bergkrystallen via Majorstuen   12 min 1   
   5     Ringen via Storo                13 min 2

answered Sep 06 '16 at 17:52

Cyrus

84,225
14
89
153

2

Niiiice. This is a tool I'd forgotten about. – Charles Duffy Sep 06 '16 at 17:53
or `lynx -dump `. – Cyrus Sep 06 '16 at 17:57
Isn't links much more actively maintained? My impression of lynx was that it was basically a dead project more than a decade ago. – Charles Duffy Sep 06 '16 at 17:58
That is possible. – Cyrus Sep 06 '16 at 18:03
No the last update was on 2016/04/26. Did you know that lynx urine is a cure against kidney stone? – Casimir et Hippolyte Sep 06 '16 at 19:47
Thank you very much! But is there only this one sentence? I tried the same with u, but not working... – Lu Liu Sep 07 '16 at 14:30

Extracting data from html in bash script

2 Answers2