I try to parse different kinds of very large Excel Files (.csv, .xlsx, .xls)
Working (.csv/.xlsx) flows
.csv is chunkable by using pandas.read_csv(file, chunksize=chunksize)
.xlsx is chunkable by unzipping it and parsing inner .xml files using lxml.etree.iterparse(zip_file.open('xl/worksheets/sheet1.xml'))
and lxml.etree.iterparse(zip_file.open('xl/sharedStrings.xml'))
, performing additional operations afterwards.
Not working (.xls) flow
.xls I can't find any info on how to split this file in chunks!
Details: My file has a type of Django's TemporaryUploadedFile
. I get it from request.data['file']
on PUT
request.
I get a path of the file like request.data['file'].temporary_file_path()
. This is '/tmp/tmpu73gux4m.upload'. (I'm not sure what the *****.upload file is. I guess it's some kind of HTTP file encoding)
When I try to read it:
file.open()
content = file.read()
the content
looks like a bytes string b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00...etc.
Question
Are there any means of encoding and parsing this bytes string?
Ideally, I would want to read .xls row by row without loading the whole file into RAM at once. Are there any means of doing it?