I'm using Polars' scan_csv()
and fetch()
to process multiple CSV files in a directory, but I keep getting a memory error. The CSV files are quite large, and I only need to process a small portion of the data. How can I avoid this memory error and process the data more efficiently? Here is my code so far:
import os
import polars as pl
df = pl.scan_csv(os.path.join("data", "*.csv"))
df = df.sort("account_balance", descending=True)
df = df.unique(subset=["account_id"], keep="first")
df.fetch(1000000, streaming=True)
df.write_csv("df.csv")
The expectation is to have the code running without any memory errors and quickly.