1

I'm using Polars' scan_csv() and fetch() to process multiple CSV files in a directory, but I keep getting a memory error. The CSV files are quite large, and I only need to process a small portion of the data. How can I avoid this memory error and process the data more efficiently? Here is my code so far:

import os
import polars as pl

df = pl.scan_csv(os.path.join("data", "*.csv"))
df = df.sort("account_balance", descending=True)
df = df.unique(subset=["account_id"], keep="first")
df.fetch(1000000, streaming=True) 

df.write_csv("df.csv")

The expectation is to have the code running without any memory errors and quickly.

Timus
  • 10,974
  • 5
  • 14
  • 28
  • 2
    The code you wrote must load everything into memory. There's no other way to order *all* rows from all files and *then* get the first row per account. Polars isn't a database and can't store intermediate results the way a database would. Nor can it know whether an account appears in one file or all, so it *must* load all of them – Panagiotis Kanavos Apr 06 '23 at 10:12
  • Where is the error happening? the `.fetch` or the `.write_csv`? Can you try replace them with `df.sink_parquet("df.parquet")` and see if it succeeds? – jqurious Apr 06 '23 at 13:55

0 Answers0