Why is there such a large difference in memory usage for dataframes between pandas and R?

Question

I am working with the data from https://opendata.rdw.nl/Voertuigen/Open-Data-RDW-Gekentekende_voertuigen_brandstof/8ys7-d773 (download CSV file using the 'Exporteer' button).

When I import the data into R using read.csv() it takes 3.75 GB of memory but when I import it into pandas using pd.read_csv() it takes up 6.6 GB of memory.

Why is this difference so large?

I used the following code to determine the memory usage of the dataframes in R:

library(pryr) 
object_size(df)

and python:

df.info(memory_usage="deep")

Pandas uses fixed dtypes to load data that may be bigger than the storage used in R. There are arguments in the read_csv that can massively reduce memory usage. Using different dtypes for numerics. Int8 int16 and int64 are great examples. — Paul Brennan, Mar 17 '21 at 10:09
I agree with Paul. This [link](https://pythonspeed.com/articles/pandas-load-less-data/) could be a great starting point to explore how you can reduce the size of the data set in Python. For reference see this awesome [resource](http://adv-r.had.co.nz/memory.html#object-size) for an in-depth exploration of memory management in R. — aimbotter21, Mar 17 '21 at 10:12
thanks! I managed to reduce the size to 3.6GB in pandas by specifying dtypes, makes a huge difference. — pieterbons, Mar 17 '21 at 10:28

tdy · Accepted Answer · 2021-03-18T20:07:44.083

I found that link super useful and figured it's worth breaking out from the comments and summarizing:

Reducing Pandas memory usage #1: lossless compression

Load only columns of interest with usecols

df = pd.read_csv('voters.csv', usecols=['First Name', 'Last Name'])

Shrink numerical columns with smaller dtypes
- int64: (default) -9223372036854775808 to 9223372036854775807
- int16: -32768 to 32767
- int8: -128 to 127
```
df = pd.read_csv('voters.csv', dtype={'Ward Number': 'int8'})
```

Shrink categorical data with dtype category

df = pd.read_csv('voters.csv', dtype={'Party Affiliation': 'category'})

Convert mostly nan data to dtype Sparse

sparse_str_series = series.astype('Sparse[str]')
sparse_int16_series = series.astype('Sparse[int16]')

Why is there such a large difference in memory usage for dataframes between pandas and R?

1 Answers1

Linked