1

im trying to convert a file with an HTML table to CSV format. An excerpt from this file follows:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
    </title></head>
    <body>
        <form name="form1" method="post" action="mypricelist.aspx" id="form1">
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />

<div>
    <table id="price_list" border="0">
<tr>
    <td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
    <td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
    <td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

i tried using

    xls2csv -x -c\; evprice.xls > evprice.csv

but that gives me an error saying

    evprice.xls is not OLE file or Error

I googled. it said that is was because the file wasn't proper xls but just html.

When i try

    file evprice.xls

its says its html so found a 'solution', using libreoffice.

    libreoffice --headless -convert-to csv ./evprice.xls 

well this does not give an error but the csv output file is all weird, like opening an exe file in notepad.

it contains a lot of strange characters like these

    —¬ž­þ9ü~ÆóXþK¢

anyone know why this is happening, and got a working solution?

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
R. Leroi
  • 19
  • 1
  • 1
  • 6
  • Is the sample data you're working with publicly available? I don't know how anyone would be able to provide something which would work with a file in an uncertain format whose content we've never seen. – Charles Duffy Feb 25 '14 at 15:05
  • sorry, its not public. i can place a part of the file – R. Leroi Feb 25 '14 at 15:07
  • 1
    I wouldn't really describe that as an "XLS" file at all -- it's an HTML table, nothing Excel or XLS-related to it at all. – Charles Duffy Feb 25 '14 at 15:13
  • ...so, given that, this looks like a duplicate of http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv (though the accepted answer there isn't automated at all, there are others which are). – Charles Duffy Feb 25 '14 at 15:14
  • well, i don't know what kind of file it is. they say its a generated xls file but whatever. libreoffice can open it when i do it manually. why is it giving these weird characters when using the command line version? – R. Leroi Feb 25 '14 at 15:17
  • *shrug*. That's a question about LibreOffice, not a question about programming. StackOverflow probably isn't the right place for it. – Charles Duffy Feb 25 '14 at 15:22
  • You might try http://ask.libreoffice.org/ – Charles Duffy Feb 25 '14 at 15:24
  • [XSLT XML to CSV](http://stackoverflow.com/questions/952365/how-do-i-write-an-xslt-to-transform-xml-to-csv) – bobah Feb 25 '14 at 15:38

1 Answers1

1

I have built a Python utility which converts all the tables in an HTML file into separate CSV files.

You can find it here.

The crux of the script is this:

from BeautifulSoup import BeautifulSoup
import csv

filename = "MY_HTML_FILE"
fin      = open(filename,'r')

print "Opening file"
fin  = fin.read()

print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
    fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in table.findAll('tr'):
      cols = row.findAll(['td','th'])
      if cols:
        cols = [x.text for x in cols]
        fout.writerow(cols)
Richard
  • 56,349
  • 34
  • 180
  • 251
  • Very useful indeed. I encountered some errors as in **"UnicodeEncodeError: 'ascii' codec can't encode character at special name..."**; however, I could fix it by adding these lines at the top of the file: `import sys; reload sys; sys.setdefaultencoding('utf-8')`. The accepted answer to this [question](https://stackoverflow.com/questions/31137552/unicodeencodeerror-ascii-codec-cant-encode-character-at-special-name/31137935#31137935) is what I actually did and suggest in this comment. – Daniel Urencio May 30 '17 at 13:43