Extract text from HTML Table

Question

I want to extract the text from the table http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm into a textile in plain text without html tags from the Mac OS X command line.

I tried a lot of sed commands, but sed will only print the whole file again. What am I doing wrong?

Example of what I tried

sed -n '/<tr>/,/<\/tr>/p' scoretable.htm (will just print table contents with html tags :( )

Have you looked at related questions ([1](http://stackoverflow.com/questions/6854586/extract-data-from-html-table-with-bash-script), [2](http://stackoverflow.com/questions/10053793/how-can-i-extract-td-from-html-in-bash), etc.)? — Lev Levitsky, Apr 07 '12 at 14:56

Kaz · Accepted Answer · 2012-04-07T15:53:09.113

A little TXR web scraping, with the help of wget to grab the page:

@(deffilter nobr ("<br />" ""))
@(deffilter brsp ("<br />" " "))
@(deffilter nosp (" " ""))
@(next "!wget 2>/dev/null -O - http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm")
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
@(skip)
<div class="scoreTableArea">
@(collect)
<h2 class="unify">@year - @event</h2>
@  (filter brsp event)
@  (collect)
<tr>
<td class="center">@pos</td>
<td>@player</td>
<td>@company</td>
<td>@date</td>
<td class="center">@points</td>
</tr>
@  (filter nobr player company date points)
@  (filter nosp pos points)
@  (until)
</tbody>
@  (end)
@(end)
@(output :filter :from_html)
@  (repeat)

Event: @event
Year: @year

DATE       POS  PT  PLAYER           COMPANY
@    (repeat)
@{date -10}  @{pos -2}  @{points 2}  @{player 16} @company
@    (end)
@  (end)

@(end)

Sample run:

$ txr  scoretable.txr

Event: Teeing off to Clobber Ken
Year: 2011

DATE       POS  PT  PLAYER           COMPANY
 Sept 2011   1  40  John Durrant     King Sumners Partnership
 Sept 2011   2  34  Grahame Pettit   Amiri Construction
  Oct 2011   3  31  Tony Deacon      Gleeds
  Oct 2011   4  29  Tony Boyle       Lacey Hickey Caley 
  Oct 2011   5  29  Richard Hemming  Scott White and Hookins
 Sept 2011   6  29  Ian McCoy        Selway Joyce
 June 2011   7  27  Julian Larkin    C&G Properties
 Sept 2011   8  25  Roque Menezes    Capita Symonds
 June 2011   9  22  Shawn Lambert    PWP Architects
 Sept 2011  10  22  Kevin Lendon     Amiri Construction

Event: Ken Watson (HNW Architects) Undisputed Amiri Golf Demon of the Downs
Year: 2010

DATE       POS  PT  PLAYER           COMPANY
      2010   1  40  Ken Watson       HNW Architects
      2010   2  37  David Heda       London Clancy
      2010   3  34  Gordon Brown     Currie & Brown
      2010   4  32  Alistair Taylor  Wildbrook Properties
             5  30  Andy Goodridge   City Estates
             6  25  Russ Pitman      Henderson Green
             7  24  Phil Piper       Piper Whitlock 
             8  23  Kevin Miller     Urban Pulse Architects
             9  19  Simon Asquith    Godsall Arnold Partnership
            10  19  Shawn Lambert    PWP Architects
            11  18  Martin Judd      Davis Langdon

Note that ` ` in the HTML is being converted to the `U+00A0` space. — Kaz, Apr 07 '12 at 15:52
best to include a link to your TXR download. Good luck to all. — shellter, Apr 07 '12 at 22:18

score 2 · Answer 2 · answered Jun 06 '12 at 17:00

sed -n 's;</\?td>;;gp' scoretable.html | \
sed -e 's;<td class="center">;;' \
    -e 's;<.*>;;'

Note that I use ; instead of / as my delimiter - I find it a bit easier to read. Sed will use whatever character you put after 's as the delimiter.

Okay, now the explanation. The first line:

-n will repress output, but the p at the end of the command tells sed to specifically print all lines matching the pattern. This will get us only the lines wrapped in <td> tags. At the same time, I'm finding anything that matches </\?td> and substituting it with nothing. /\? means / must not appear or appear only once, so this will match both the opening and closing tags. The g at the end, or global, means that it won't stop trying to match the pattern after it succeeds for the first time in a line. Without g it would only substitute the opening tag.

The output from this is piped into sed again on the second line:

-e just specifies that there is an editing command to run. If you're just running one command it's implied, but here I run two (the next one is on the third line).

This removes <td class="center">, and the next line removes any other tags (in this case the <br> tags.

The last command can only be run if you're sure that there's only at most one tag on a line. Otherwise, the .* will be greedy and match too much, so in:

<td class="center">24 </ br>

it would match the entire line, and remove everything.

Extract text from HTML Table

2 Answers2

Linked