0

I am pretty new to Python and I was going through some of the uses of the pandas library. However, I could not find a way to load only a partial excel file into the memory and play with it. For example, if I set the memory limit as 1MB, the program should be able to read the first 1MB from the excel file of a size larger than 1MB.

From the answer mentioned here, I see an option to load a certain number of rows. But I would not know the number of rows in the input file. Also, I do not know how many bytes of data has been read by this code.

Is there a way to load the number of rows in an iterative way where in the number of bytes read can also be calculated in each iteration and can be cumulatively summed up?

Ravi
  • 879
  • 2
  • 9
  • 23

1 Answers1

1

1.) conversion factor

"Taste" some example data near the head of the worksheet, compute an average of how many bytes per row, then use that to predict how many rows fit in your memory budget.

2.) polars

The polars project has a heavy emphasis on "use less RAM!" and on rapid I/O. A convenient .to_pandas() method makes it trivially easy to convert a polars DataFrame to your favorite format. Consider doing the filtering in polars and handing off the result to pandas, formatted as the rest of your app expects it.

3.) generator

For CSV this is easy, and definitely won't do extra malloc's. For other formats we might do an allocation for the entire sheet, but then we can definitely avoid Pandas allocations for unwanted rows.

We will use a dict reader, plus a generator for early termination.

from sys import getsizeof
import openpyxl_dictreader

df = pd.DataFrame(read_initial(1_000_000, filespec, sheet))

def read_initial(budget: int, filespec: Path, sheet: str):
    size = 0
    reader = openpyxl_dictreader.DictReader(filespec, sheet, read_only=True, data_only=True)
    for row in reader:
        size += (sum(map(getsizeof, row.values()))
               + sum(map(getsizeof, row.keys())))
        if size > budget:
            break
        yield row

Feel free to use a fancier cost estimate, if accuracy of recursive getsizeof isn't to your liking.

Consider converting *.xlsx files to a more stream-friendly format like .csv.

We prefer the read_only=True keyword arg so we only consume constant memory despite large file size.

If you're unable to evaluate formulas and essentially wished the Excel file was a CSV file, then supply a data_only=True kwarg.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • do you know if this line of code `reader = openpyxl_dictreader.DictReader(filespec, sheet)` loads the entire file into the memory? – Ravi Mar 03 '23 at 17:08
  • got your point about early pruning of the rows, but my doubt was more on loading the file initially which is `reader = openpyxl_dictreader.DictReader(filespec, sheet)`. Would this not load the entire sheet into the memory? – Ravi Mar 06 '23 at 05:14
  • You can choose to load the entire sheet into memory if you wish. Or you can go with the recommended `read_only=True` setting, which uses a fixed buffer allocation, independent of file size. – J_H Mar 06 '23 at 05:47
  • Got it, thanks @J_H. May I know what is the default buffer allocation size and also if there is a way to change it? – Ravi Mar 06 '23 at 06:16
  • Show us your memory measurements. What RAM constraint do you run within, and under different conditions what allocated amounts are recorded by your benchmark code? – J_H Mar 06 '23 at 06:29
  • as I mentioned, I am new to python and I have not written any benchmark code for this yet, but I run with a mere 2GB RAM and that is where memory if of primary concern for us. I will research on how to write the benchmark coding in python and try to do so – Ravi Mar 06 '23 at 06:41
  • tried this and I observed that it is printing the underlying cell formula rather than the cell value. May I know how to get the cell value for the formula based cells? – Ravi Mar 06 '23 at 10:01
  • You don't _have_ to learn how to read [documentation](https://openpyxl.readthedocs.io/en/stable/api/openpyxl.reader.excel.html#openpyxl.reader.excel.load_workbook). But you may find it a valuable skill to acquire. The truth is out there. Go find it. – J_H Mar 06 '23 at 17:38