What happens when pandas read_csv is run on a file that is too large

Question

If a file fed into pandas read_csv is too large, will it raise an exception? What I'm afraid of is that it will just read what it can, say the first 1,000,000 rows and proceed as if there was no problem.

Does there exist situations in which pandas will fail to read all records in a file but also fail to raise an exception (print errors).

it would lead to out of memory(OOM) error. A good read to understand how to handle such cases https://medium.com/analytics-vidhya/optimized-ways-to-read-large-csvs-in-python-ab2b36a7914e — apoorva kamath, Mar 09 '22 at 07:25
Thanks, so there aren't any cases where pandas would not read the full file but not print out errors? — teha921, Mar 09 '22 at 07:50
@teha921 it really depends on how the operating system manages memory. On linux, you could fill up the entire RAM, then use up the swap space, then the OS would just get stuck waiting for the OOM killer service to terminate some low priority tasks that use high amounts of memory. Then your pandas program would get SIGKILL and not throw any exceptions (also, finally blocks won't work). [Here's a diagram and more explanations](https://www.kernel.org/doc/gorman/html/understand/understand016.html) — nurettin, Mar 09 '22 at 08:35

score 0 · Answer 1 · answered Mar 09 '22 at 08:56

If you have large dataset, and if you want to read it manytimes, I recommend you to use .pkl file

Or you can use try exception method.

However, if you still want to use csv file, you can visit this link and find solution How do I read a large csv file with pandas?

score 0 · Answer 2 · answered Mar 09 '22 at 09:05

I'd recommend using dask which is a high-level library that supports parallel computing,

You can easily import all your data but it won't be loaded in your memory

import pandas as pd
import dask.dataframe as dd

df = dd.read_csv('data.csv')
df

and from there , you can compute only selected columns/rows you are interested in:

df_selected = df[columns].loc[indices_to_select]

df_selected.compute()

What happens when pandas read_csv is run on a file that is too large

2 Answers2