4

I am working with the data from https://opendata.rdw.nl/Voertuigen/Open-Data-RDW-Gekentekende_voertuigen_brandstof/8ys7-d773 (download CSV file using the 'Exporteer' button).

When I import the data into R using read.csv() it takes 3.75 GB of memory but when I import it into pandas using pd.read_csv() it takes up 6.6 GB of memory.

Why is this difference so large?

I used the following code to determine the memory usage of the dataframes in R:

library(pryr) 
object_size(df)

and python:

df.info(memory_usage="deep")
pieterbons
  • 1,604
  • 1
  • 11
  • 14
  • 4
    Pandas uses fixed dtypes to load data that may be bigger than the storage used in R. There are arguments in the read_csv that can massively reduce memory usage. Using different dtypes for numerics. Int8 int16 and int64 are great examples. – Paul Brennan Mar 17 '21 at 10:09
  • 2
    I agree with Paul. This [link](https://pythonspeed.com/articles/pandas-load-less-data/) could be a great starting point to explore how you can reduce the size of the data set in Python. For reference see this awesome [resource](http://adv-r.had.co.nz/memory.html#object-size) for an in-depth exploration of memory management in R. – aimbotter21 Mar 17 '21 at 10:12
  • 2
    thanks! I managed to reduce the size to 3.6GB in pandas by specifying dtypes, makes a huge difference. – pieterbons Mar 17 '21 at 10:28

1 Answers1

8

I found that link super useful and figured it's worth breaking out from the comments and summarizing:

Reducing Pandas memory usage #1: lossless compression

  1. Load only columns of interest with usecols

    df = pd.read_csv('voters.csv', usecols=['First Name', 'Last Name'])
    
  2. Shrink numerical columns with smaller dtypes

    • int64: (default) -9223372036854775808 to 9223372036854775807
    • int16: -32768 to 32767
    • int8: -128 to 127
    df = pd.read_csv('voters.csv', dtype={'Ward Number': 'int8'})
    
  3. Shrink categorical data with dtype category

    df = pd.read_csv('voters.csv', dtype={'Party Affiliation': 'category'})
    
  4. Convert mostly nan data to dtype Sparse

    sparse_str_series = series.astype('Sparse[str]')
    sparse_int16_series = series.astype('Sparse[int16]')
    
tdy
  • 36,675
  • 19
  • 86
  • 83