0

I know how to read .xls files with pandas. However, it returns all the data. I want to load data on demand, I mean, I want a generator that returns the next row each time is iterated. See this question for general files.

I know openpyxl can do this, following this webpage. However, it doesn't support old .xls files. It recommends me to use xlrd, however, I don't know how to do what I want with that package.

The documentation tells how to do that sheet by sheet, but not row by row (my file has only one sheet).

  • A pandas DataFrame has a built-in generator called *iterrows()* which is probably what you need – DarkKnight Sep 17 '22 at 10:55
  • I checked with my data, and the `xlrd.open_workbook` output occupies 48 bytes, while the `pandas.read_excel` output takes 5,361 bytes. The test excel file is 32,256 bytes. I'm still wondering if `xlrd` is already doing a "lazy reading" by the things I need to acces data. But I would use `xlrd` seeing the sizes. – Abel Gutiérrez Sep 17 '22 at 15:34

2 Answers2

2

Pandas doesn't support lazy loading, it reads the file and keeps everything in memory.

Polars -- an alternative to pandas -- supports lazy loading.
Unfortunately this isn't yet implemented for xls files.

One solution is to convert the excel file to csv and use the scan_csv function.

import polars as pl
pl.scan_csv("sample.csv")
<polars.internals.lazyframe.frame.LazyFrame object at 0x7f0ae95d1c00>
  • That's a solution, although I don't know if it's worth it. I mean, I don't want to store the `.csv` file, so the algorithm would be like write-read-delete and the file would use some space in the disk. Although this isn't a problem for my data. – Abel Gutiérrez Sep 17 '22 at 15:28
0

You can convert Dataframe to LazyFrame:

import polars as pl
df = dflazy.lazy()
dflazy
Andreas Violaris
  • 2,465
  • 5
  • 13
  • 26