Scrape HTML tables from a given URL into CSV

Question

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
An option to asciify any unicode.
An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)

Related questions:

Sorry @dreeves, but Stack Overflow isn't a "free programmers for hire" service. — Frank Krueger, Apr 09 '10 at 22:42
(done; thanks. note that i ask this in case a tool like this exists. and in case not, i'm writing it myself and will share it here) — dreeves, Apr 10 '10 at 14:28
If you already have [`jq`](https://jqlang.github.io/jq/) and [`pup`](https://github.com/ericchiang/pup) installed (both useful tools in their own right), then see [this answer](https://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv/72418856#72418856) for a one-liner that will very likely what you're looking for. If you want that _exact_ syntax, wrap in some shell script that passes the first argument to an `:nth-of-type()` CSS selector. — TheDudeAbides, Sep 03 '23 at 04:06

dreeves · Accepted Answer · 2010-04-12T16:10:34.420

This is my first attempt:

http://yootles.com/outbox/tablescrape.py

It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics

it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:

1: [  1 cols,   1 rows] Contents 1 Men's rec
2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]

Then if you run it again, asking for the 2nd table,

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2

You get a reasonable plaintext data table:

100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | VebjÃ¸rn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]

score 1 · Answer 2 · answered Apr 23 '10 at 10:47

Using TestPlan I produced a rough script. Given the complexity of web tables it'll likely need to be tailored on all sites.

This first script lists the tables on the page:

# A simple table scraping example. It lists the tables on a page
#
# Cmds.Site = the URL to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
GotoURL %Cmds.Site%

set %Count% 1
foreach %Table% in (response //table)
    Notice Table #%Count%
    # find a suitable name, look back for a header
    set %Check% ./preceding::*[name()='h1' or name()='h2' or name()='h3'][1]
    if checkIn %Table% %Check%
        Notice (selectIn %Table% %Check%)
    end

    set %Count% as binOp %Count% + 1
end

The second script then extracts the data of one table into a CSV file.

# Generic extract of contents of a table in a webpage
# Use list_tables to get the list of table and indexes
#
# Cmds.Site = the URL to scan
# Cmds.Index = Table index to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
default %Cmds.Index% 2

GotoURL %Cmds.Site%

set %Headers% //table[%Cmds.Index%]/tbody/tr[1]
set %Rows% //table[%Cmds.Index%]/tbody/tr[position()>1]

# Get an cleanup the header fields 
set %Fields% withvector
end
foreach %Header% in (response %Headers%/*)
    putin %Fields% (trim %Header%)
end
Notice %Fields%

# Create an output CSV
call unit.file.CreateDataFile with
    %Name% %This:Dir%/extract_table.csv
    %Format% csv
    %Fields% %Fields%
end
set %DataFile% %Return:Value%

# Now extract each row
foreach %Row% in (response %Rows%)
    set %Record% withvector
    end
    foreach %Cell% in (selectIn %Row% ./td)
        putin %Record% (trim %Cell%)
    end

    call unit.file.WriteDataFile with
        %DataFile% %DataFile%
        %Record% %Record%
    end
end

call unit.file.CloseDataFile with
    %DataFile% %DataFile%
end

My CSV file looks like below. Note that wikipedia has extract information in each cell. There are many ways to get rid of it, but not in a generic fashion.

Shot put,22.47 m,"Timmermann, UlfUlf Timmermann",East Germany (GDR),1988 1988 Seoul,"01988-09-23 September 23, 1988",[25]
Discus throw,69.89 m,"Alekna, VirgilijusVirgilijus Alekna",Lithuania (LTU),2004 2004 Athens,"02004-08-23 August 23, 2004",[26]
Hammer throw,84.80 m,"Litvinov, SergeySergey Litvinov",Soviet Union (URS),1988 1988 Seoul,"01988-09-26 September 26, 1988",[27]
Javelin throw,90.57 m,"Thorkildsen, AndreasAndreas Thorkildsen",Norway (NOR),2008 2008 Beijing,"02008-08-23 August 23, 2008",[28]

Thanks! This looks nice. In the version I wrote I ended up putting in special cases to remove what you're calling wikipedia's extract information. — dreeves, Apr 24 '10 at 15:14

TheDudeAbides · Answer 3 · 2023-09-03T08:08:55.550

Using jq and pup, and a tip of the hat to this SO answer:

#!/bin/bash
# tablescrape - convert nth HTML table on a page to CSV or tab-delimited
# author: https://stackoverflow.com/users/785213
# source: https://stackoverflow.com/a/77031218
set -u

input=${1:?"Expected a file, URL, or '-' as the first argument."}
nth=${2:-1}
mode=${3:-csv}

(
    if [[ -r $input || $input == - ]]; then
        cat "$input"
    else
        # '--location' means "follow redirects"
        curl --silent --show-error --location "$input"
    fi
) \
  | pup "table.wikitable:nth-of-type($nth) tr json{}" \
  | jq --raw-output '.[]
      | [
          .children[]                            # all .children of <tr>s
            | select(.tag=="td" or .tag=="th")   # that are <td>s or <th>s
            | [ .. | .text? ]                    # recurse, looking for .text
            | map(select(.))                     # filter out empty nodes
            | join(" ")                          # concatenate .texts together
        ]                                        # all <td>s/<th>s as an array
      | @'$mode                                  # join together as CSV or TSV

Both jq and pup are super-useful on their own. It seemed like one of those tools ~~(or else xidel)~~ should be able to extract HTML tables directly to a delimited text file, but I guess it isn't so. Fortunately, pipes, man. They're so good!

Usage

RECORDS='https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics'

# read from a URL
./tablescrape $RECORDS 2

# read from a pipe or redirection
curl -sS $RECORDS | ./tablescrape - 1 tsv

curl -sS $RECORDS > records.html
< records.html ./tablescrape - 1 tsv

# read from a file
./tablescrape records.html 1 tsv

Update: Oops, I spoke too soon. Xidel can do it:

# NB: uses Bash's "ANSI-C quoting" feature for a literal tab character
xidel -s https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics \
  -e $'//table[2]//tr/string-join(td|th,"\t")' \
  | column -t -s$'\t'

Output:

Event                 Record      Athlete(s)                                                   Nation                Games                Date                Ref(s)
100 metres            10.61       Elaine Thompson Herah                                         Jamaica (JAM)        2020 Tokyo           July 31, 2021       [32]
200 metres            ♦21.34      Florence Griffith Joyner                                      United States (USA)  1988 Seoul           September 29, 1988  [33][34]
400 metres            48.25       Marie-José Pérec                                              France (FRA)         1996 Atlanta         July 29, 1996       [35]
⋮
3,000 m steeplechase  8:58.81     Gulnara Galkina-Samitova                                      Russia (RUS)         2008 Beijing         August 17, 2008     [41]
4×100 m relay         ♦40.82      Tianna MadisonAllyson FelixBianca KnightCarmelita Jeter       United States (USA)  2012 London          August 10, 2012     [42]
⋮

That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close with not a lot of effort.

Scrape HTML tables from a given URL into CSV

3 Answers3

Usage

Linked