How can I convert an HTML table to CSV?

Question

How do I convert the contents of an HTML table (<table>) to CSV format? Is there a library or linux program that does this? This is similar to copy tables in Internet Explorer, and pasting them into Excel.

Possible duplicate of [Export to CSV using jQuery and html](https://stackoverflow.com/questions/16078544/export-to-csv-using-jquery-and-html) — Dave Jarvis, Apr 21 '18 at 00:54

score 76 · Answer 1 · answered Sep 10 '09 at 02:33

76

This method is not really a library OR a program, but for ad hoc conversions you can

put the HTML for a table in a text file called something.xls
open it with a spreadsheet
save it as CSV.

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

But you probably would prefer a Perl or Ruby script...

answered Sep 10 '09 at 02:33

pavium

14,808
4
33
50

1

LibreOffice Calc will do this, but I've found it to be really slow at reading in big (a few MB) html table files in this way. So for small files, or one off use this is the easiest way, but if you have a load of files to convert a script is nice. – atomicules Sep 06 '11 at 11:10
Worked for me in LibreOffice Calc 4.4.3.2 – David Bullock Dec 09 '15 at 04:33
This doesn't work for tables with cells that spawn multiple lines inside a cell\ – Lime Sep 10 '18 at 21:46
2

Related: select the table data in your web browser and **Copy**. Open an editor that only handles plain text such as Notepad.exe or BBEdit.app and **Paste**. Save the document as a *.csv*. It may be tab-separated, so you can open it in a spreadsheet application and export as a properly formatted CSV. – Mat Gessel Aug 06 '19 at 23:31
Worked for me, even though I did not read the instructions carefully, and saved as **something.html** instead, then opened it in excel on a mac. (In the browser, I used the contextual popup menu of the inspector to copy the OuterHTML of the table element. It is much quicker than using the source.) – Harald Hanche-Olsen Dec 10 '20 at 08:45
If you are finding Excel can't open up the file with the HTML table in it because it's too large, consider splitting the HTML into multiple tables, opening each, and combining them in a single spreadsheet once you've been able to open them. My example today: table with 53 columns, 17421 rows, as an HTML table in a file, 58.8 MB. Split the table into 5 separate table files manually, opened and copied into a single .xlsx , 3.4 MB. – David Thompson Aug 10 '22 at 17:07
You can simply download the html file (like with `wget`) and then just `localc --convert-to csv `. – pattivacek Oct 06 '22 at 20:08

score 33 · Answer 2 · edited Oct 29 '15 at 15:56

Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.

The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.

Here's the explanation:

Get the Contents of the URL using cURL, dump stderr to null (no progress meter)

curl "http://www.webpagewithtableinit.com/" 2>/dev/null

.

I only want Table elements (return only lines with TABLE,TR,TH,TD tags)

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

.

Remove any Whitespace at the beginning of the line.

| sed 's/^[\ \t]*//g'

.

Remove newlines

| tr -d '\n\r'

.

Replace </TR> with newline

| sed 's/<\/TR[^>]*>/\n/Ig'

.

Remove TABLE and TR tags

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'

.

Remove ^<TD>, ^<TH>, </TD>$, </TH>$

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'

.

Replace </TD><TD> with comma

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

.

Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.

Hope this helps someone!

Cool idea, but doesn't work for a lot of HTML files. Mine have the data inside the table on its own line (i.e. `\ncell value\n`), and this script ends up stripping out that data because it removes every line that doesn't have a table tag in it. — Hayden Schiff, Dec 31 '15 at 01:01
cool stuff. it worked for me even with some new lines in data fields. great job! — VK Kashyap, Jan 03 '16 at 06:52
Worked great for me with minor tweaks, literally saved me hours! — user1747036, Oct 23 '17 at 09:35
@DRendar I like your answer, but I can't use this for http://www.dsebd.org/market_summary.php this link. — alhelal, Nov 29 '17 at 03:42

score 26 · Answer 3 · edited May 28 '22 at 19:58

26

Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

Worked for my basic test case.

edited May 28 '22 at 19:58

Benjamin W.

46,058
19
106
116

answered Sep 10 '09 at 04:16

audiodude

1,865
16
22

Thanks, this worked great for a 32M file which had 220k rows dumped from a database into a php table. (Clearly not an option for most spreadsheets!) – mm2001 Dec 08 '14 at 02:30
this is the reason why I love ruby so much. – vincentleest May 20 '15 at 19:47

score 13 · Answer 4 · edited May 28 '22 at 19:58

Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan or rowspan. Enjoy.

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

Thanks, this was useful! I've used the delimiter "," instead of "\t" though. — botismarius, Mar 02 '16 at 12:22
Thanks @botismarius. I added the option to specify delimiters as arguments. — Yuval, Sep 21 '16 at 08:59

score 13 · Answer 5 · answered Dec 08 '16 at 22:51

Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.

1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.

2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)

Saved me loads of time and figuring out different conversions.

Thanks. Yes, Google does a lot of nifty clean ups before the final import. It works well. Do try https://docs.google.com/spreadsheets/u/0/ — Rahul, Apr 16 '17 at 12:38

score 6 · Answer 6 · edited Sep 12 '09 at 04:30

6

I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV and HTML::Parser.

edited Sep 12 '09 at 04:30

Brad Gilbert

33,846
11
78
129

answered Sep 10 '09 at 02:27

Chris Simmons

1,843
1
12
18

score 6 · Answer 7 · edited Feb 12 '20 at 03:10

Assuming that you've designed an HTML page containing a table, I would recommend this solution. Worked like charm for me:

$(document).ready(() => {
  $("#buttonExport").click(e => {
    // Getting values of current time for generating the file name
    const dateTime = new Date();
    const day      = dateTime.getDate();
    const month    = dateTime.getMonth() + 1;
    const year     = dateTime.getFullYear();
    const hour     = dateTime.getHours();
    const minute   = dateTime.getMinutes();
    const postfix  = `${day}.${month}.${year}_${hour}.${minute}`;

    // Creating a temporary HTML link element (they support setting file names)
    const downloadElement = document.createElement('a');

    // Getting data from our `div` that contains the HTML table
    const dataType  = 'data:application/vnd.ms-excel';
    const tableDiv  = document.getElementById('divData');
    const tableHTML = tableDiv.outerHTML.replace(/ /g, '%20');

    // Setting the download source
    downloadElement.href = `${dataType},${tableHTML}`;

    // Setting the file name
    downloadElement.download = `exported_table_${postfix}.xls`;

    // Trigger the download
    downloadElement.click();

    // Just in case, prevent default behaviour
    e.preventDefault();
  });
});

Courtesy: http://www.kubilayerdogan.net/?p=218

You can edit the file format to .csv here:

downloadElement.download = `exported_table_${postfix}.csv`;

Does this work in IE? When I click the export button, nothing happens. — Arcadian, Apr 20 '16 at 17:45

jmcnamara · Answer 8 · 2012-01-31T09:05:07.217

5

With Perl you can use the HTML::TableExtract module to extract the data from the table and then use Text::CSV_XS to create a CSV file or Spreadsheet::WriteExcel to create an Excel file.

edited Jan 31 '12 at 09:05

answered Sep 10 '09 at 07:27

jmcnamara

38,196
6
90
108

score 5 · Answer 9 · answered Aug 03 '17 at 07:09

5

Here a simple solution without any external lib:

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

It works for me without any issue

answered Aug 03 '17 at 07:09

Met Kiani

834
1
9
19

score 4 · Answer 10 · edited May 28 '22 at 20:00

Here's the approach I took using only tr and sed:

< table.txt tr -d '\n' | 
sed -e 's/<tr[^>]*>/\n/g' -e 's/<[^>]*t[dh]>/,/g' -e 's/<[^>]*>//g'

Explanation

tr -d '\n' delete newlines
's/<tr[^>]*>/\n/g' convert tr tags into newlines to break data into table rows
's/<[^>]*t[dh]>/,/g' convert closing td/th tags into commas
's/<[^>]*>//g' delete all other html tags

Sample input
(from an Outlook email that attempted to render an HTML table using MsoNormal):

<table class=3D"MsoNormalTable" border=3D"0" cellspacing=3D"0" cellpadding=3D"0" width=3D"420" style=3D"width:315.0pt;border-collapse:collapse">
<tbody>
<tr style=3D"height:15.0pt">
<td width=3D"107" nowrap=3D"" style=3D"width:80.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"107" nowrap=3D"" valign=3D"bottom" style=3D"width:80.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"64" nowrap=3D"" valign=3D"bottom" style=3D"width:48.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"79" nowrap=3D"" valign=3D"bottom" style=3D"width:59.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"64" nowrap=3D"" valign=3D"bottom" style=3D"width:48.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
</tr>
<tr style=3D"height:6.75pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><b><span style=3D"color:black">ID</span></b><b><span style=3D"color:black"><o:p></o:p></span></b></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-left:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><b><span style=3D"color:black">Price<o:p></o:p></span></b></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-top:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">064159Q</span><span style=3D"color:black"><o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">121.85<o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-top:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">2420128</span><span style=3D"color:black"><o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">10.00<o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
</tbody>
</table>

Sample output


,,,,,
,,,,,
ID,Price,,,,
064159Q,121.85,,,,
2420128,10.00,,,,

See Non-greedy regex matching in sed for a discussion of the approach.

By far the coolest, most elegant solution; thank you! :-) As a single `sed` call, and considering headers too: `tr -d '\n' < table.csv | sed -e 's/]*>/\n/g' -e 's/<[^>]*t[dh]>/,/g' -e 's/<[^>]*>//g'` — hoijui, Feb 20 '22 at 14:41

score 4 · Answer 11 · edited May 28 '22 at 19:59

Based on audiodude's answer, but simplified by using the built-in CSV library

require 'nokogiri'
require 'csv'

doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')

doc.xpath('//table//tr').each do |row|
    tarray = [] #temporary array
    row.xpath('td').each do |cell|
        tarray << cell.text #Build array of that row of data.
    end
    csv << tarray #Write that row out to csv file
end

csv.close

I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.

score 3 · Answer 12 · answered May 28 '22 at 19:54

Here's a method that uses pup and jq.

Assuming that infile.html contains one <table> element, we can select its rows using pup, and convert to JSON:

pup 'table tr json{}' --file infile.html

This returns an array of objects, with a children array for each row. For an example with a header row, two data rows, and three columns:

[
 {
  "children": [
   { "tag": "th", "text": "ID" },
   { "tag": "th", "text": "First name" },
   { "tag": "th", "text": "Last name" }
  ],
  "tag": "tr"
 },
 {
  "children": [
   { "tag": "td", "text": "123" },
   { "tag": "td", "text": "Anna" },
   { "tag": "td", "text": "Alphabet" }
  ],
  "tag": "tr"
 },
 {
  "children": [
   { "tag": "td", "text": "456" },
   { "tag": "td", "text": "Brandon" },
   { "tag": "td", "text": "Betazoid" }
  ],
  "tag": "tr"
 }
]

To convert that to CSV, we can use jq (see snippet):

pup 'table tr json{}' --file infile.html \
    | jq --raw-output 'map(.children | map(.text))[] | @csv'

resulting in

"ID","First name","Last name"
"123","Anna","Alphabet"
"456","Brandon","Betazoid"

Exactly what I was looking for. There was zero percent chance I was going to write code for this, because I was 95% sure `xidel` or `pup` was capable. So thanks! — TheDudeAbides, Sep 03 '23 at 03:53
Side note: on pages with multiple (or nested) tables like [this one](https://www.loc.gov/standards/iso639-2/php/English_list.php), you can use the browser's dev tools to get the exact CSS path to the target element and give that to `pup`. In Firefox, it's _Copy_ → _CSS Path_ when you right-click on the element in the Inspector pane. For that particular page, you'd use `pup 'table tbody tr td table tr json{}`. — TheDudeAbides, Sep 03 '23 at 03:54

score 2 · Answer 13 · edited May 23 '17 at 11:54

2

here's a few options

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

How can I scrape an HTML table to CSV?

https://addons.mozilla.org/en-US/firefox/addon/1852

edited May 23 '17 at 11:54

Community

1
1

answered Sep 10 '09 at 09:34

Gene T

5,156
1
24
24

score 2 · Answer 14 · edited May 28 '22 at 19:59

This is a very old thread, but may be someone like me will bump into it. I have made some additions for the audiodude's script to read the html from file instead adding it to the code, and another parameter that controls printing of the header lines.

the script should be run like that

ruby <script_name> <file_name> [<print_headers>]

the code is:

require 'nokogiri'

print_header_lines = ARGV[1]

File.open(ARGV[0]) do |f|

  table_string=f
  doc = Nokogiri::HTML(table_string)

  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

score 1 · Answer 15 · edited May 28 '22 at 20:01

Here is an example using pQuery and Spreadsheet::WriteExcel:

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.

Further things to consider:

You may want to pick up td tags to create excel header(s).
And you may have issues with rowspan & colspan.

To see if rowspan or colspan is being used you can:

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

score 1 · Answer 16 · answered Jan 22 '10 at 00:41

OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!

score 1 · Answer 17 · edited May 28 '22 at 20:00

This is based on atomicules' answer but more succinct and also processes th (header) cells as well as td cells. I also added the strip method to get rid of the extra whitespaces.

CSV.open("output.csv", 'w') do |csv|
  doc.xpath('//table//tr').each do |row|
    csv << row.xpath('th|td').map {|cell| cell.text.strip}
  end
end

Wrapping the code inside the CSV block ensures that the file will be closed properly.

If you just want the text and don't need to write it to a file, you can use this:

doc.xpath('//table//tr').inject('') do |result, row|
  result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end

score 1 · Answer 18 · edited May 28 '22 at 19:59

Here's an updated version of Yuvai's answer, which properly handles fields that require quoting (i.e. fields that contain commas in the data, double quotes, or span multiple lines)

#!/usr/bin/env python3
from html.parser import HTMLParser
import sys
import re

class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim=","):
        HTMLParser.__init__(self)
        self.despace_re = re.compile("\s+")
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim
        self.quote_buffer = False
        self.buffer = None

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True
        elif tag == "br":
            self.quote_buffer = True
            self.buffer += self.row_delim

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False
        if self.buffer != None:
            # Quote if needed...
            if self.quote_buffer or self.cell_delim in self.buffer or "\"" in self.buffer:
                # Need to quote! First, replace all double-quotes with quad-quotes
                self.buffer = self.buffer.replace("\"", "\"\"")
                self.buffer = "\"{0}\"".format(self.buffer)
            sys.stdout.write(self.buffer)
            self.quote_buffer = False
            self.buffer = None

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            if self.buffer == None:
                self.buffer = ""
            self.buffer += self.despace_re.sub(" ", data).strip()
            self.data_interrupt = False

parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

One enhancement for this script could be to add support for specifying a different line delimiter (or auto-calculate the platform-correct one), and a different column delimiter.

score 1 · Answer 19 · answered Mar 07 '22 at 00:40

you can convert html to csv using libreoffice or sed

libreoffice:

mkdir in out
cp -v *.html in
rename 's/([^.]+).html/$1.xls/g' in/*.html
## 59 is ;
## 44 is ,
libreoffice --convert-to 'csv:Text - txt - csv (StarCalc):59,,0,3' in/*.xls --outdir out

see https://developer-core.blogspot.com/2022/03/preobrazovanie-html-v-csv-ili-obrabotka-html-tablic-v-bash.html

or sed:

mkdir out
cp -v *.html out
sed -i ':a;N;$!ba
     s/<html.\+<table[^>]\+>//Ig
     s#\s*</td>\s*</tr>\s*<tr>\s*<td>\s*#\n#Ig
     s#\s*</td>\s*<td>\s*#;#Ig
     s/<[^>]\+>//g;s/\s\{2,\}//g' out/*.html
rename 's/([^.]+).html/$1.csv/g' out/*.html

see https://developer-core.blogspot.com/2022/03/preobrazovanie-html-v-csv-ili-obrabotka-html-tablic-v-bash-v2.html

An example can be found in the online bash "sandbox": https://onlinegdb.com/1oivp0uGm

score 0 · Answer 20 · answered Feb 15 '21 at 17:38

Read HTML File and Use Ruby's `CSV` and `nokogiri` to Output to `.csv`.

Based on @audiodude's answer but modified in the following ways:

Reads from a file to get the HTML. This is handy for long HTML tables, but easily modified to just use a static String if your HTML table is small.
Uses CSV's built-in library for converting an Array into a CSV row.
Outputs to a .csv file instead of just printing to STDOUT.
Gets both the table headers (th) and the table body (td).

# Convert HTML table to CSV format.

require "nokogiri"

html_file_path = ""

html_string = File.read( html_file_path )

doc = Nokogiri::HTML( html_string )

CSV.open( Rails.root.join( Time.zone.now.to_s( :file ) + ".csv" ), "wb" ) do |csv|
  doc.xpath( "//table//tr" ).each do |row|
    csv << row.xpath( "th|td" ).collect( &:text ).collect( &:strip )
  end
end

Osogtustack · Answer 21 · 2021-08-22T16:07:27.480

Depending on what you need you can simply:

var table ='';var selector='#customers';
document.querySelectorAll(`${selector} tr th`).forEach(h=>table+=`${h.innerText.trim()};`);table=table.trim();table+='\r\n';
document.querySelectorAll(`${selector} tr`).forEach(tr=>{tr.querySelectorAll('td').forEach(td=>table+=`${td.innerText.trim()};`);table+='\r\n';});

change "selector" to target your table and after executing "table" will have the contents of your csv

aditionally you can:

var a = document.createElement('a');a.href=`data:text/csv;base64,${btoa(table)}`;a.download="table.csv";a.click();

to download the contents of "table"

lezambranof · Answer 22 · 2023-05-01T20:53:40.457

Read HTML from the Web or a local file, and convert each HTML table to CSV

Considering that:

Usually the HTML source is on the Web
Frequently, a Web page contains many tables

We can require a script that accepts a URI and converts each HTML table to CSV. For this purpose I advice to use XPath and whatever programming language of your choice that implements an API for it.

The script

The following is a Bash script that calls a PHP script (note #!/usr/bin/env php on first line).

#!/usr/bin/env php
<?php
$input_file = $argv[1];
$html_string = file_get_contents($input_file) 
    or exit("Ops! unable to open file: ($input_file)");

$dom = new DOMDocument();
//@$dom->loadHTMLFile( $input_file ); 
@$dom->loadHTML($html_string);
$xpath = new DOMXPath($dom);

$base_name = basename($input_file, ".html");
$tables = $dom->getElementsByTagName('table');
for ($i=0, $n = $tables->length; $i < $n; ++$i) { 
    
    $file_csv = fopen("{$base_name}_table_".($i+1).".csv", 'w'); 
    $rows = $xpath->query(".//tr", $tables->item($i));
    foreach ($rows as $row) {
        
        $arr = array();
        $cells = $xpath->query("td|th", $row);
        foreach ($cells as $cell) {
            $arr[] = trim($cell->textContent);
        }
        fputcsv($file_csv, $arr);
        unset($arr);
    } //rows
    fclose($file_csv);
} //tables
?>

Save the script as html2csv and grant execute permissions:

$ chmod +x html2csv

Usage

Type ./html2csv URL or ./html2csv file.html

How can I convert an HTML table to CSV?

22 Answers22

Read HTML File and Use Ruby's `CSV` and `nokogiri` to Output to `.csv`.

Read HTML from the Web or a local file, and convert each HTML table to CSV

The script

Usage

Linked

How can I convert an HTML table to CSV?

22 Answers22

Read HTML File and Use Ruby's CSV and nokogiri to Output to .csv.

Read HTML from the Web or a local file, and convert each HTML table to CSV

The script

Usage

Linked

Read HTML File and Use Ruby's `CSV` and `nokogiri` to Output to `.csv`.