Reading a portion of a large xlsx file with python

Question

I have a large .xlsx file with 1 million rows. I don't want to open the whole file in one go. I was wondering if I can read a chunk of the file, process it and then read the next chunk? (I prefer to use pandas for it.)

MaxU - stand with Ukraine · Answer 1 · 2019-09-05T09:42:16.520

3

UPDATE: 2019-09-05

The chunksize parameter has been deprecated as it wasn't used by pd.read_excel(), because of the nature of XLSX file format, which will be read up into memory as a whole during parsing.

There are more details about that in this great SO answer...

OLD answer:

you can use read_excel() method:

chunksize = 10**5
for chunk in pd.read_excel(filename, chunksize=chunksize):
    # process `chunk` DF

if your excel file has multiple sheets, take a look at bpachev's solution

edited Sep 05 '19 at 09:42

answered Jul 27 '16 at 21:32

MaxU - stand with Ukraine

205,989
36
386
419

4

I get with pandas v0.19.2: `NotImplementedError: chunksize keyword of read_excel is not implemented` – mikal94305 Jun 01 '17 at 04:31
Same for pandas `0.20.2` – iipr Nov 14 '17 at 16:47
1

@iipr, [it was removed in Pandas 0.17.0](https://stackoverflow.com/a/45346340/5741205) – MaxU - stand with Ukraine Nov 14 '17 at 16:53
1

Oh I see. @MaxU Is there another alternative for reading an Excel using chunking? – iipr Nov 14 '17 at 16:59
@iipr, i’ll try to implement it, but I’m not sure it will be fast. Why do you need it? Does your data set fit in memory? – MaxU - stand with Ukraine Nov 14 '17 at 17:04
@MaxU thank you for your help. Ideally it should be an argument of `pd.read_excel` like for `pd.read_csv` (unless I'm missing something, which could be the case). My current dataset does fit in memory, but because I'm currently using a pc with 32GB, which doesn't always happen. I want to use it to move data contained in Excels to HDFs. – iipr Nov 14 '17 at 17:17
1

@iipr, it's not that easy, as both standard engines Pandas is using - `xlrd` and `openpyxl` are reading the whole file in memory first. So I think you would need to convert it to CSV file first and then you can read it in chunks... – MaxU - stand with Ukraine Nov 14 '17 at 17:55
1

@MaxU that is exactly what I tried yesterday: save the Excel as CSV and then read it. In general, Excels will not be too big as for to not fit in memory, so its fine I guess. The only thing is to be careful with the conversion to avoid data leaks/mistakes. Thanks! – iipr Nov 15 '17 at 14:52

score 2 · Answer 2 · answered Jul 27 '16 at 21:33

2

Yes. Pandas supports chunked reading. You would go about reading an excel file like so.

import pandas as pd
xl = pd.ExcelFile("myfile.xlsx")
for sheet_name in xl.sheet_names:
  reader = xl.parse(sheet_name, chunksize=1000):
  for chunk in reader:
    #parse chunk here

answered Jul 27 '16 at 21:33

bpachev

2,162
15
17

7

I get with pandas v0.19.2: `NotImplementedError: chunksize keyword of read_excel is not implemented` – mikal94305 Jun 01 '17 at 04:31
Same for pandas `0.20.2` – iipr Nov 14 '17 at 16:45
You are suggesting O[N^2) complexity to read files .... – graj499 Nov 21 '22 at 10:57

Reading a portion of a large xlsx file with python

2 Answers2

Linked