3

can't copy xlsx and xlsm files with xlrd as it says "formatting info= True" not yet implemented and openpyxl runs out of memory when doing the following:

import xlrd
from xlutils.copy import copy
from openpyxl import load_workbook

if file_format == 'xls':
        input_file=xlrd.open_workbook(input_file_location,formatting_info=True)
        wb = copy(input_file)
        wb.save(destination_file)
if file_format == 'xlsx' or file_format == 'xlsm' :
        input_file  =  load_workbook(input_file_location,data_only=True)
        input_file.save(destination_file)
jmcnamara
  • 38,196
  • 6
  • 90
  • 108
gaurav gurnani
  • 2,789
  • 3
  • 19
  • 18
  • Try opening the file with `read_only=True` as well – Anand S Kumar Jul 07 '15 at 12:10
  • I did, but then i can't save the file ( it says, file cannot be saved as it is in read only mode). The ultimate goal is to create a copy of the existing file having all the formulas removed. – gaurav gurnani Jul 07 '15 at 12:15
  • yep, i cannot use write_only=True inside load_workbook(). it is an invalid argument here. I believe, it can be used only when we create new workbooks like wb = Workbook(write_only=True) – gaurav gurnani Jul 07 '15 at 12:27

1 Answers1

3

If you just want to copy the file then just do that using shutil. There are still several things that openpyxl doesn't support such as images and charts that will be lost. And, as you're seeing, memory is an issue. Each cell uses about 45 kB of memory.

The openpyxl documentation is pretty clear about the different options used when opening workbooks: data_only only read the results of any formulae and ignore the formulae.

See https://bitbucket.org/openpyxl/openpyxl/issue/171/copy-worksheet-function if you want to copy worksheets.

Otherwise you can use two workbooks, one in read-only mode and the other in write-only mode. But if you want to copy, this is best done in the file system.

If you only want to copy the values from one workbook to another then you can combine read-only and write-only modes to reduce the memory footprint. The following pseudo-code should give you a basis.

wb1 = load_workbook("file.xlsx", read_only=True, data_only=True)
wb2 = Workbook(write_only=True)
for ws1 in wb1:
    ws2 = wb2.create_sheet(ws1.title)
    for r1 in ws1:
        ws2.append(r1) # this might need unpacking to use values
wb2.save("copy.xlsx")
Charlie Clark
  • 18,477
  • 4
  • 49
  • 55
  • I can't use shutil here because I wan't all the formula's removed. i don't mind losing images and charts so openpyxl was a good option only if it didn't run out of memory for large files. The code above works fine for files of size 200MB but anything above that, it goes out of memory. The data only mode gets rid of the formulas but i don't know how i should solve the memory issue. Can you tell me what exactly you mean by having one workbook open in read only and the other in write only mode? – gaurav gurnani Jul 07 '15 at 12:40
  • Ah, it wasn't clear from your question that you only want the data. – Charlie Clark Jul 07 '15 at 13:17
  • Your suggested workaround did work for me, but it takes forever to load a 100 MB file.So, can't use it as I have to work on much larger files. I wish it was much faster :). – gaurav gurnani Jul 10 '15 at 18:24
  • A 100 MB Excel is a huge file with probably millions of cells. openpyxl is pretty fast at reading but converting individual bits of XML to Python objects and back to XML is bound to take some time. For such large files it might indeed be faster to extract the relevant XML files and strip the formulae out of them using something like `sed` and updating the archive. You could also try using a streaming XML parser but I suspect it wouldn't be that much faster than openpyxl. The truth is that Excel is not really the best file format for exchanging large amounts of data but it is ubiquitous. – Charlie Clark Jul 11 '15 at 09:16
  • Pandas.read_excel and then Dataframe.to_excel worked for me :) but it requires me to read sheet by sheet as it breaks on sheets containing charts and graphs. OpenPyxl is excellent for smaller files :) so using a mix of both – gaurav gurnani Jul 12 '15 at 12:19
  • BTW, Thanks for all the suggestions. – gaurav gurnani Jul 12 '15 at 12:33
  • Pandas is doing the same thing as I suggested but using different libraries. xlrd is faster, and uses less memory than openpyxl when reading files but support of the file format is limited. openpyxl works fine for larger files as noted above. – Charlie Clark Jul 12 '15 at 17:07