1

I try to parse different kinds of very large Excel Files (.csv, .xlsx, .xls)

Working (.csv/.xlsx) flows

.csv is chunkable by using pandas.read_csv(file, chunksize=chunksize)

.xlsx is chunkable by unzipping it and parsing inner .xml files using lxml.etree.iterparse(zip_file.open('xl/worksheets/sheet1.xml')) and lxml.etree.iterparse(zip_file.open('xl/sharedStrings.xml')), performing additional operations afterwards.

Not working (.xls) flow

.xls I can't find any info on how to split this file in chunks!

Details: My file has a type of Django's TemporaryUploadedFile. I get it from request.data['file'] on PUT request.

I get a path of the file like request.data['file'].temporary_file_path(). This is '/tmp/tmpu73gux4m.upload'. (I'm not sure what the *****.upload file is. I guess it's some kind of HTTP file encoding)

When I try to read it:

file.open()
content = file.read()

the content looks like a bytes string b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00...etc.

Question

  1. Are there any means of encoding and parsing this bytes string?

  2. Ideally, I would want to read .xls row by row without loading the whole file into RAM at once. Are there any means of doing it?

Eugene Kovalev
  • 3,407
  • 1
  • 15
  • 17
  • can't you convert file into csv and then use pandas . – Piyush S. Wanare Apr 11 '18 at 09:51
  • refer this https://stackoverflow.com/questions/47455562/loading-excel-file-chunk-by-chunk-with-python-instead-of-loading-full-file-into. – Piyush S. Wanare Apr 11 '18 at 09:53
  • @Piyush S. Wanare no, I can't convert it. Thousands of customers send me `.xls` files. There is no option to inform them about converting their files to `.csv`. Of course, it would be the best option for me but it's impossible. – Eugene Kovalev Apr 11 '18 at 10:07
  • @Piyush S. Wanare the link you applied contains the info about `.xlsx` only. I need info about `.xls` – Eugene Kovalev Apr 11 '18 at 10:08

0 Answers0