I am loading a huge csv (18GB) into memory and noticing very large differences between R and Python. This is on an AWS ec2 r4.8xlarge which has 244 Gb of memory. Obviously this is an extreme example, but the principle holds for smaller files on real machines too.
When using pd.read_csv
my file took ~30 mins to load and took up 174Gb of memory. Essentially so much that I then can't do anything with it. By contrast, R's fread()
from the data.table
package took ~7 mins and only ~55Gb of memory.
Why does the pandas object take up so much more memory than the data.table object? Furthermore, why fundamentally is the pandas object almost 10x larger than the text file on disk? It's not like .csv is a particularly efficient way to store data in the first place.