1.) conversion factor
"Taste" some example data near the head of the worksheet,
compute an average of how many bytes per row,
then use that to predict how many rows fit in your memory budget.
2.) polars
The polars project
has a heavy emphasis on "use less RAM!" and on rapid I/O.
A convenient .to_pandas()
method makes it trivially easy
to convert a polars DataFrame to your favorite format.
Consider doing the filtering in polars and handing off
the result to pandas, formatted as the rest of your
app expects it.
3.) generator
For CSV this is easy, and definitely won't do extra malloc's.
For other formats we might do an allocation for the entire sheet,
but then we can definitely avoid Pandas allocations for unwanted rows.
We will use a dict reader,
plus a generator for early termination.
from sys import getsizeof
import openpyxl_dictreader
df = pd.DataFrame(read_initial(1_000_000, filespec, sheet))
def read_initial(budget: int, filespec: Path, sheet: str):
size = 0
reader = openpyxl_dictreader.DictReader(filespec, sheet, read_only=True, data_only=True)
for row in reader:
size += (sum(map(getsizeof, row.values()))
+ sum(map(getsizeof, row.keys())))
if size > budget:
break
yield row
Feel free to use a fancier cost estimate,
if accuracy of recursive getsizeof
isn't to your liking.
Consider converting *.xlsx files to a more stream-friendly format like .csv.
We prefer the
read_only=True
keyword arg so we only consume constant memory despite large file size.
If you're unable to evaluate formulas and essentially wished
the Excel file was a CSV file, then supply a
data_only=True
kwarg.