This is common thing in pandas, and is relevant not only to CSV but all datasets.
When dealing with very large CSV files (or parquet etc.) and running out of memory, you can mitigate it in a few ways:
- Like it was mentioned above - read the CSV files in chunks: You can use the
chunksize
parameter of the read_csv()
function in pandas to read the CSV files in smaller chunks. This way, you won't load the entire file into memory at once.
If you have it available, you can use a Dask.dataframe
: Dask is a parallel computing library that can handle large datasets that don't fit into memory. Dask.dataframe
is similar to pandas, but it can work on datasets that are too large to fit into your memory.
Use JupySQL along with DuckDB, this allows you to read the data only when you need to, instead of loading everything into the memory. Here is a recent tutorial about how to do it.
Leverage your database: If your dataset is too large to fit into memory, you can always store it in a database and use SQL to join and manipulate the data. This might be overkill if you don't have a DB up and running already.
Scale your machine via a cloud-based instance: You can use a cloud-based solution like Amazon Web Services or Google Cloud Platform to store and manipulate your data. This might be too much especially if you're running locally ad-hoc.