Loading very large CSV dataset into Python and R, Pandas struggles

Question

I am loading a huge csv (18GB) into memory and noticing very large differences between R and Python. This is on an AWS ec2 r4.8xlarge which has 244 Gb of memory. Obviously this is an extreme example, but the principle holds for smaller files on real machines too.

When using pd.read_csv my file took ~30 mins to load and took up 174Gb of memory. Essentially so much that I then can't do anything with it. By contrast, R's fread() from the data.table package took ~7 mins and only ~55Gb of memory.

Why does the pandas object take up so much more memory than the data.table object? Furthermore, why fundamentally is the pandas object almost 10x larger than the text file on disk? It's not like .csv is a particularly efficient way to store data in the first place.

Similar to [https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) — RobertMyles, Oct 31 '17 at 18:39
Are you using 1.10.5 of `fread` which is new, in dev, and not yet on CRAN? Try also [paratext](https://github.com/wiseio/paratext). — Matt Dowle, Oct 31 '17 at 18:43
@MattDowle in which case are you suggesting that `fread` is reading incorrectly? The question is about why Python/Pandas doesn't mirror memory usage of `fread`. — roganjosh, Oct 31 '17 at 18:47
@roganjosh `fread` is much faster in dev (i.e. 1.10.5) which is why I'm asking. No I'm not suggesting it is reading incorrectly. The OP is also asking about speed as well as memory usage. — Matt Dowle, Oct 31 '17 at 18:49
@MattDowle, you could do that on chunks from any computer, there trying to just load in all the data for analysis I beleive — DJK, Nov 01 '17 at 00:33
That’s right, I’m trying to load it into memory to run an sklearn model on it. The database I’m getting it from outputs this huge .tsv which is why I have this issue. You’re right that converting it to another format (hdf, parquet, feather, etc) is probably best, I just wanted to keep the pipeline as simple as possible. Sidenote: reading it to R with data.table, writing it to feather, then reading the feather into Python took ~12 mins total, compared to ~30 mins just using pd.read_csv straight to Python. That’s crazy to me. — seth127, Nov 02 '17 at 02:03
was there ever an answer why pandas in 2019 still lags R's `fread` for speed? I see the answer below for memory, but no answer for speed. R's `fread` is so fast, it can often render converting to a binary format (feather, parquet) unnecessary. — ivo Welch, Aug 30 '19 at 06:24

eddi · Answer 1 · 2017-10-31T20:09:26.300

15

You won't be able to beat the speed of fread, but as far as memory usage goes my guess is that you have integers that are being read in as 64-bit integers in python.

Assuming your file looks like this:

a,b
1234567890123456789,12345

In R, you'll get:

sapply(fread('test.txt'), class)
#          a          b
#"integer64"  "integer"

Whereas in python (on a 64-bit machine):

pandas.read_csv('test.txt').dtypes
#a   int64
#b   int64

Thus you'll use more memory in python. You can force the type in read_csv as a workaround:

pandas.read_csv('test.txt', dtype={'b': numpy.int32}).dtypes
#a   int64
#b   int32

Small integers are also going to be the reason for both R and python objects taking up more space than the .csv file, since e.g. "1" in a .csv file takes up 2 bytes (char + either comma or end of line), but either 4 or 8 bytes in memory.

edited Oct 31 '17 at 20:09

answered Oct 31 '17 at 19:58

eddi

49,088
6
104
155

I think this is probably spot on. I never thought to use the `dtype` option in read_csv. Also, a lot my columns are just binary, but I bet it's treating all those 1's and 0's as big integers too. Is there an efficient numpy type for binary (or very small) integers or should I just use `int32`? Probably won't effect the speed, but it should definitely make it a lot smaller. Thanks. – seth127 Nov 01 '17 at 00:59
You can try using `int8`. – eddi Nov 01 '17 at 02:20
I’ll try that. Any idea if there’s a way to say “this one column is int64, all the others are int8”? The documentation says you have to do it as a dict but I have several thousand columns (hence the large file size) so that seems a bit tedious. – seth127 Nov 02 '17 at 02:09
I'd probably just read in the first couple of lines, create the dict based on the names, and then read the whole thing in. – eddi Nov 02 '17 at 15:59
Good call. That wouldn’t be hard at all. Thanks. – seth127 Nov 03 '17 at 21:59

Loading very large CSV dataset into Python and R, Pandas struggles

1 Answers1