1

Quite puzzled by the format of these .xls files as they are not really .xls files, I've put the first few lines of the file below for reference, full file here.

Converting normal .xls is no problem with p.save_book_as(file_name=fname, dest_file_name=fname+'x').

I would like to convert to .xlsx in bulk with python, is this even possible with the below format?

MIME-Version: 1.0
X-Document-Type: Workbook
Content-Type: multipart/related; boundary="----=_NextPart_86ab7b61_9054_45ca_a3a6_49bc8ebc61db"

This document is a Single File Web Page, also known as a Web Archive file.  If you are seeing this message, your browser or editor doesn't support Web Archive files.  Please download a browser that supports Web Archive, such as Microsoft Internet Explorer.

------=_NextPart_86ab7b61_9054_45ca_a3a6_49bc8ebc61db
Content-Location: file:///C:/86ab7b61_9054_45ca_a3a6_49bc8ebc61db/Workbook.html
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="us-ascii"

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:x=3D"urn:schemas-microsoft-com:office:excel" xmlns=3D"http://www.w3.org/TR/REC-html40">
<head>
<meta name=3D"Excel Workbook Frameset">

<meta name=3DProgId content=3DExcel.Sheet>
<link rel=3DFile-List href=3D"Worksheets/filelist.xml">

<!--[if gte mso 9]><xml>
 <x:ExcelWorkbook>
  <x:ExcelWorksheets>
   <x:ExcelWorksheet>
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
CalH
  • 75
  • 4
  • How did you obtain this file? – snakecharmerb Oct 06 '20 at 06:47
  • Link to full file is above in question, it is a financial statement for Amazon available on the SEC's website. For some reason the SEC went from using normal XLS format for a few years, then this HTML/XLS format for a couple of years then onto XLSX for all their financial statements. – CalH Oct 06 '20 at 07:29

1 Answers1

1

This seems to be "Excel compatible HTML". While I do not know a pure python converter, you could try to use excel as an external converter, i.e. open those files and save them to xlsx, as described here and copied below. This requires the pywin32 package, to access excel remotely.

import win32com.client as win32
fname = "full+path+to+xls_file"
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(fname)

wb.SaveAs(fname+"x", FileFormat = 51)    #FileFormat = 51 is for .xlsx extension
wb.Close()                               #FileFormat = 56 is for .xls extension
excel.Application.Quit()
Christian Karcher
  • 2,533
  • 1
  • 12
  • 17