0

I want to get the data (line, destination, time and pos) displayed at this subway schedule page.

The code I wrote for now is :

#!/bin/bash
curl "http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint=">ruter.html
awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][0-9]/ {print $3, $5, $8, $10 }' ruter.html
Alison R.
  • 4,204
  • 28
  • 33
Lu Liu
  • 1
  • I did change a little on this code on awk part,as: awk -F "*td>|*tr>" '/<\/*t[rd]>.*/ {print $1, $2, $4, $5 }' ruter1.html and worked out some result like: Bergkrystallen via Majorstuen Ringen via Storo 1 min 2 ........(so on the 6 group) seems still messy and with html tags. But I know little about awk, I can't improve it. So I would rather user a loop or something make it easier to understand. – Lu Liu Sep 07 '16 at 15:15
  • Another quetion is, that I also want to print only the one row(first subway time) which the Pro=1 or Pro=2 (according to the flag with the command line,for example if I give the command ./subway.sh -W, it give the latest subway time of platform1 or if I give the command ./subway.sh -E , it give the latest subway time of platform2) – Lu Liu Sep 07 '16 at 15:16

2 Answers2

3

Don't use regular expressions for this at all. Convert from HTML to XML, and use XPath -- a query language that works on document semantics, as opposed to mere text-matching:

url="http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint="

curl "$url" | \
  tidy -asxml -n -c -b -q --show-warnings no | \
  xmlstarlet sel -N h=http://www.w3.org/1999/xhtml \
    -t -m '//h:tr[h:td]' \
    -v ./h:td[1] -o $'\t' \
    -v ./h:td[2] -o $'\t' \
    -v ./h:td[4] -o $'\t' \
    -v ./h:td[5] -n | \
  column -s $'\t' -t

For the given input HTML, as of today, the output is:

5  Vestli via Majorstuen          nå      1
4  Vestli via Storo               2 min   2
5  Ringen via Majorstuen          4 min   1
5  Sognsvann                      7 min   2
4  Bergkrystallen via Majorstuen  10 min  1
5  Ringen via Storo               12 min  2

The tools used here are:

  • HTML Tidy (for converting messy HTML into compliant XHTML)
  • XMLStarlet (for performing XPath queries)
  • column (for formatting the output into aligned columns)

Note also that $'\t' syntax requires that the shell in use really be bash (not /bin/sh).

Community
  • 1
  • 1
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Thank you so much! But I tried as u write, didn't get the result... should I also install something? Another trouble is, that I need to print only the one row(latest one) which the Pro=1 or Pro=2 (according to the flag with the command line,for example if I give the command ./subway.sh -W or ./subway.sh -E) – Lu Liu Sep 07 '16 at 14:36
  • You do need tidy and xmlstarlet installed -- if either isn't, there should be a self-explanatory error on stderr. As for filtering for a specific platform, you can make it `-m "//h:tr[h:td[5] = '1']"` or `'2'` as appropriate. – Charles Duffy Sep 07 '16 at 16:34
3

With links:

links -dump 'http://mon.ruter.no/SisMonitor/Refresh?stopid=3010370&computerid=acba4167-b79f-4f8f-98a6-55340b1cddb3&isOnLeftSide=true&blocks=&rows=6&test=&stopPoint='

Output:

   Linje Destinasjon                     Tid    Pos 
   Line  Destination                     Time   Pos 
   4     Vestli via Storo                3 min  2   
   5     Vestli via Majorstuen           3 min  1   
   5     Ringen via Majorstuen           5 min  1   
   5     Sognsvann                       11 min 2   
   4     Bergkrystallen via Majorstuen   12 min 1   
   5     Ringen via Storo                13 min 2 
Cyrus
  • 84,225
  • 14
  • 89
  • 153