Save memory with big pandas dataframe

Question

I have got a huge dataframe (pandas): 42 columns, 19 millions rows and different dtypes. I load this dataframe from a csv file to JupyterLab. Afterwards I do some operations on it (adding more colums) and I write it back to a csv file. A lot of the columns are int64. In some of these columns many rows are empty. Do you know a technique / specific dtype which I can apply on int64 columns in order to reduce the size of the dataframe and write it to a csv file more effient saving memory capacity and reduce the size of the csv file?

Would you provide me with some example of code? [For columns containing strings only I changed the dtype to 'category'.] thank you

Is it possilbe to process your file in batches? E.g. load 100k rows, apply changes, write into output csv, and repeat. — Ray, Nov 26 '21 at 08:54

s_pike · Accepted Answer · 2021-12-01T10:46:13.550

If I understand your question correctly, the issue is the size of the csv file when you write it back to disk.

A csv file is just a text file, and as such the columns aren't stored with dtypes. It doesn't matter what you change your dtype to in pandas, it will be written back as characters. This makes csv very inefficient for storing large amounts of numerical data.

If you don't need it as a csv for some other reason, try a different file type such as parquet. (I have found this to reduce my file size by 10x, but it depends on your exact data.)

If you're specifically looking to convert dtypes, see this question, but as mentioned, this won't help your csv file size: Change column type in pandas

Save memory with big pandas dataframe

1 Answers1