5

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

  • To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
  • An option to asciify any unicode.
  • An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)

Related questions:

Community
  • 1
  • 1
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • Sorry @dreeves, but Stack Overflow isn't a "free programmers for hire" service. – Frank Krueger Apr 09 '10 at 22:42
  • But where's the **Question**? – Frank Krueger Apr 09 '10 at 22:49
  • 1
    Edit the question if you would like to reformulate it. – Frank Krueger Apr 09 '10 at 23:48
  • (done; thanks. note that i ask this in case a tool like this exists. and in case not, i'm writing it myself and will share it here) – dreeves Apr 10 '10 at 14:28
  • If you already have [`jq`](https://jqlang.github.io/jq/) and [`pup`](https://github.com/ericchiang/pup) installed (both useful tools in their own right), then see [this answer](https://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv/72418856#72418856) for a one-liner that will very likely what you're looking for. If you want that _exact_ syntax, wrap in some shell script that passes the first argument to an `:nth-of-type()` CSS selector. – TheDudeAbides Sep 03 '23 at 04:06

3 Answers3

13

This is my first attempt:

http://yootles.com/outbox/tablescrape.py

It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics

it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:

1: [  1 cols,   1 rows] Contents 1 Men's rec
2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]

Then if you run it again, asking for the 2nd table,

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2

You get a reasonable plaintext data table:

100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]
dreeves
  • 26,430
  • 45
  • 154
  • 229
1

Using TestPlan I produced a rough script. Given the complexity of web tables it'll likely need to be tailored on all sites.

This first script lists the tables on the page:

# A simple table scraping example. It lists the tables on a page
#
# Cmds.Site = the URL to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
GotoURL %Cmds.Site%

set %Count% 1
foreach %Table% in (response //table)
    Notice Table #%Count%
    # find a suitable name, look back for a header
    set %Check% ./preceding::*[name()='h1' or name()='h2' or name()='h3'][1]
    if checkIn %Table% %Check%
        Notice (selectIn %Table% %Check%)
    end

    set %Count% as binOp %Count% + 1
end

The second script then extracts the data of one table into a CSV file.

# Generic extract of contents of a table in a webpage
# Use list_tables to get the list of table and indexes
#
# Cmds.Site = the URL to scan
# Cmds.Index = Table index to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
default %Cmds.Index% 2

GotoURL %Cmds.Site%

set %Headers% //table[%Cmds.Index%]/tbody/tr[1]
set %Rows% //table[%Cmds.Index%]/tbody/tr[position()>1]

# Get an cleanup the header fields 
set %Fields% withvector
end
foreach %Header% in (response %Headers%/*)
    putin %Fields% (trim %Header%)
end
Notice %Fields%

# Create an output CSV
call unit.file.CreateDataFile with
    %Name% %This:Dir%/extract_table.csv
    %Format% csv
    %Fields% %Fields%
end
set %DataFile% %Return:Value%

# Now extract each row
foreach %Row% in (response %Rows%)
    set %Record% withvector
    end
    foreach %Cell% in (selectIn %Row% ./td)
        putin %Record% (trim %Cell%)
    end

    call unit.file.WriteDataFile with
        %DataFile% %DataFile%
        %Record% %Record%
    end
end

call unit.file.CloseDataFile with
    %DataFile% %DataFile%
end

My CSV file looks like below. Note that wikipedia has extract information in each cell. There are many ways to get rid of it, but not in a generic fashion.

Shot put,22.47 m,"Timmermann, UlfUlf Timmermann",East Germany (GDR),1988 1988 Seoul,"01988-09-23 September 23, 1988",[25]
Discus throw,69.89 m,"Alekna, VirgilijusVirgilijus Alekna",Lithuania (LTU),2004 2004 Athens,"02004-08-23 August 23, 2004",[26]
Hammer throw,84.80 m,"Litvinov, SergeySergey Litvinov",Soviet Union (URS),1988 1988 Seoul,"01988-09-26 September 26, 1988",[27]
Javelin throw,90.57 m,"Thorkildsen, AndreasAndreas Thorkildsen",Norway (NOR),2008 2008 Beijing,"02008-08-23 August 23, 2008",[28]
edA-qa mort-ora-y
  • 30,295
  • 39
  • 137
  • 267
  • Thanks! This looks nice. In the version I wrote I ended up putting in special cases to remove what you're calling wikipedia's extract information. – dreeves Apr 24 '10 at 15:14
0

Using jq and pup, and a tip of the hat to this SO answer:

#!/bin/bash
# tablescrape - convert nth HTML table on a page to CSV or tab-delimited
# author: https://stackoverflow.com/users/785213
# source: https://stackoverflow.com/a/77031218
set -u

input=${1:?"Expected a file, URL, or '-' as the first argument."}
nth=${2:-1}
mode=${3:-csv}

(
    if [[ -r $input || $input == - ]]; then
        cat "$input"
    else
        # '--location' means "follow redirects"
        curl --silent --show-error --location "$input"
    fi
) \
  | pup "table.wikitable:nth-of-type($nth) tr json{}" \
  | jq --raw-output '.[]
      | [
          .children[]                            # all .children of <tr>s
            | select(.tag=="td" or .tag=="th")   # that are <td>s or <th>s
            | [ .. | .text? ]                    # recurse, looking for .text
            | map(select(.))                     # filter out empty nodes
            | join(" ")                          # concatenate .texts together
        ]                                        # all <td>s/<th>s as an array
      | @'$mode                                  # join together as CSV or TSV

Both jq and pup are super-useful on their own. It seemed like one of those tools (or else xidel) should be able to extract HTML tables directly to a delimited text file, but I guess it isn't so. Fortunately, pipes, man. They're so good!

Usage

RECORDS='https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics'

# read from a URL
./tablescrape $RECORDS 2

# read from a pipe or redirection
curl -sS $RECORDS | ./tablescrape - 1 tsv

curl -sS $RECORDS > records.html
< records.html ./tablescrape - 1 tsv

# read from a file
./tablescrape records.html 1 tsv

Update: Oops, I spoke too soon. Xidel can do it:

# NB: uses Bash's "ANSI-C quoting" feature for a literal tab character
xidel -s https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics \
  -e $'//table[2]//tr/string-join(td|th,"\t")' \
  | column -t -s$'\t'

Output:

Event                 Record      Athlete(s)                                                   Nation                Games                Date                Ref(s)
100 metres            10.61       Elaine Thompson Herah                                         Jamaica (JAM)        2020 Tokyo           July 31, 2021       [32]
200 metres            ♦21.34      Florence Griffith Joyner                                      United States (USA)  1988 Seoul           September 29, 1988  [33][34]
400 metres            48.25       Marie-José Pérec                                              France (FRA)         1996 Atlanta         July 29, 1996       [35]
⋮
3,000 m steeplechase  8:58.81     Gulnara Galkina-Samitova                                      Russia (RUS)         2008 Beijing         August 17, 2008     [41]
4×100 m relay         ♦40.82      Tianna MadisonAllyson FelixBianca KnightCarmelita Jeter       United States (USA)  2012 London          August 10, 2012     [42]
⋮

That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close with not a lot of effort.

TheDudeAbides
  • 1,821
  • 1
  • 21
  • 29