12

I need to export 24 pandas data frames ( 140 columns x 400 rows) to Excel, each into a different sheet.

I am using pandas’ built-in ExcelWriter. Running 24 scenarios, it takes:

51 seconds to write to an .xls file (using xlwt)

86 seconds to write to an .xlsx file (using XlsxWriter)

141 seconds to write to an .xlsm file (using openpyxl)

21 seconds to just run the program (no Excel output)

The problem with writing to .xls is that the spreadsheet contains no formatting styles, so if I open it in Excel, select a column, and click on the ‘comma’ button to format the numbers, it tells me: ‘style comma not found’. I don’t get this problem writing to an .xlsx, but that’s even slower.

Any suggestions on how to make the exporting faster? I can’t be the first one to have this problem, yet after hours of searching forums and websites I haven’t found any definite solution.

The only thing I can think of is to use Python to export to csv files, and then write an Excel macro to merge all the CSVs into a single spreadsheet.

The .xls file is 10 MB, and the .xlsx 5.2 MB

Thanks!

Awtszs
  • 323
  • 1
  • 8
  • 33
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • I think that the difference in size stems from excel and the different file formats and not from the writing method. In addition - you don't need an excel macro to merge csvs into a spreadsheet, writing a script with xlwt is fairly straight forward. – Korem Sep 16 '14 at 08:38
  • Yes, of course, the difference in size depends on the compression embedded in the xlsx format. That's not the issue. The issue is that taking 60 seconds just to write 5 MB of data seems excessive. Given that Python packages are so slow writig pandas data frames to Excel, would they be quicker writing csv to Excel? I haven't had time to test it yet but I doubt it. My hope is that merging the CSVs using Excel and VBA would have been faster. It's a bit frustrating that something as banal as exporting a few MBs of data takes so long... – Pythonista anonymous Sep 16 '14 at 08:43
  • So, I have run some more tests: loading the 24 CSVs and putting them onto separate sheets of an Excel file takes: Excel VBA macro: 5 seconds. Python script with xlsxWriter: 24 seconds (34 with {'constant_memory': True}, weird). Maybe the best option is to run the VBA macro from Python, as in these examples: http://stackoverflow.com/questions/2141967/using-python-to-program-ms-office-macros http://stackoverflow.com/questions/441758/driving-excel-from-python-in-windows – Pythonista anonymous Sep 16 '14 at 09:43
  • I don't have the code with me now, but I've written styles/formats with xlwt. I can look it up if you're interested. Also, try here for more info on xlwt: https://groups.google.com/forum/#!forum/python-excel – JohnE Sep 16 '14 at 12:26
  • Here is a [benchmark for different Python to Excel modules](https://gist.github.com/jmcnamara/ba25c2bf4ba0777065eb). Make sure to run it on your target machine since it will be affected by Python/module versions, CPU, RAM and Disk I/O. – jmcnamara Sep 16 '14 at 12:50
  • @jmcnamara , thanks - very useful. However, I understand PyExcelerate doesn't support Pandas, and most likely won't, at least not in the near future (https://github.com/pydata/pandas/pull/undefined). Maybe I could use it to merge the CSVs into an Excel, but at this point I'm inclined to do that via a VBA macro invoked from Python, also because formatting the sheets will presumably be easier that way. – Pythonista anonymous Sep 16 '14 at 13:07
  • Correct, PyExcelerate isn't supported in Pandas. I made an [initial attempt](https://github.com/pydata/pandas/pull/5128#issuecomment-31095852) at porting it in but it didn't work out. I'm going to post the benchmark as an answer. If nothing else it will help someone else see the differences so they can make an informed decision. – jmcnamara Sep 16 '14 at 13:16

2 Answers2

11

Here is a benchmark for different Python to Excel modules.

And here is the output for 140 columns x (400 x 24) rows using the latest version of the modules at the time of posting:

Versions:
    python      : 2.7.7
    openpyxl    : 2.0.5
    pyexcelerate: 0.6.3
    xlsxwriter  : 0.5.7
    xlwt        : 0.7.5

Dimensions:
    Rows = 9600 (400 x 24)
    Cols = 140

Times:
    pyexcelerate          :  11.85
    xlwt                  :  17.64
    xlsxwriter (optimised):  21.63
    xlsxwriter            :  26.76
    openpyxl   (optimised):  95.18
    openpyxl              : 119.29

As with any benchmark the results will depend on Python/module versions, CPU, RAM and Disk I/O and on the benchmark itself. So make sure to verify these results for your own setup.

Also, since you asked specifically about Pandas, please note that PyExcelerate isn't supported.

jmcnamara
  • 38,196
  • 6
  • 90
  • 108
  • I thought pyexcelerate was obsolete and had been subsumed into xlwt many years ago. Is this a new version and/or unrelated to old pyexcelerate? – JohnE Sep 16 '14 at 14:19
  • 1
    That was PyExcelerator. PyExcelerate is a different unrelated module. It is confusing. – jmcnamara Sep 16 '14 at 17:14
1

For what it's worth, this is how I format the output in xlwt. The documentation is (or at least was) pretty spotty so I had to guess most of this!

import xlwt

style = xlwt.XFStyle()
style.font.name = 'Courier'
style.font.height = 180
style.num_format_str = '#,##0'

# ws0 is a worksheet
ws0.write( row, col, value, style )

Also, I believe I duplicated your error message when attempting to format the resulting spreadsheet in excel (office 2010 version). It's weird, but some of the drop down tool bar format options work and some don't. But it looks like they all work fine if I go to "format cells" via a right click.

JohnE
  • 29,156
  • 8
  • 79
  • 109