How to estimate how much memory a Pandas' DataFrame will need?

Question

I have been wondering... If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this will need? Just trying to get a better feel of data frames and memory...

You could always look at the process & it's memory usage for a single file. If you're running linux, try `top` and then `Shift + M` to sort my memory usage. — Jay, Aug 06 '13 at 20:25
I feel I should advertise this [open pandas issue](https://github.com/pydata/pandas/issues/3871). — Andy Hayden, Aug 06 '13 at 20:48
I have a large dataframe with 4 million rows. I discovered that its empty subset `x=df.loc[[]]` takes `0.1` seconds to get computed (to extract zero rows) and, furthermore, takes hundreds of megabytes of memory, just as the original dataframe, probably because of some copying underneath. — Sergey Orshanskiy, Oct 04 '14 at 06:13
new link for the [old post](http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/) by the pandas lead developer — saladi, Feb 22 '18 at 19:36

score 176 · Answer 1 · edited Oct 14 '21 at 07:16

176

df.memory_usage() will return how many bytes each column occupies:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

To include indexes, pass index=True.

So to get overall memory consumption:

>>> df.memory_usage(index=True).sum()
731731000

Also, passing deep=True will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.

This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).

edited Oct 14 '21 at 07:16

Epsi95

8,832
1
16
34

answered Oct 06 '15 at 12:34

Aleksey Sivokon

1,911
1
12
6

1

is the sum of all the columns' memory usages really the impact on memory usage? I can imagine there to be more overhead. – firelynx Nov 02 '15 at 15:31
21

You really also want `deep=True` – smci Nov 23 '16 at 19:20
The sum of df.memory_usage() does not equal sys.getsizeof(df) ! There are many overheads. As smci mentioned, You need `deep=True` – vagabond Jul 13 '17 at 22:50
21

FYI, `memory_usage()` returns the memory usage in bytes (as you would expect). – engelen Sep 18 '17 at 07:55
7

Why such a huge difference between with/without deep=True? – Nguai al Jan 18 '19 at 15:02

Brian Burns · Answer 2 · 2018-10-30T10:32:20.967

142

Here's a comparison of the different methods - sys.getsizeof(df) is simplest.

For this example, df is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) - read from a 427kb shapefile

sys.getsizeof(df)

>>> import sys
>>> sys.getsizeof(df)
(gives results in bytes)
462456

df.memory_usage()

>>> df.memory_usage()
...
(lists each column at 8 bytes/row)

>>> df.memory_usage().sum()
71712
(roughly rows * cols * 8 bytes)

>>> df.memory_usage(deep=True)
(lists each column's full memory usage)

>>> df.memory_usage(deep=True).sum()
(gives results in bytes)
462432

df.info()

Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes - as the docstring says, "Memory usage is shown in human-readable units (base-2 representation)." So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.

>>> df.info()
...
memory usage: 70.0+ KB

>>> df.info(memory_usage='deep')
...
memory usage: 451.6 KB

edited Oct 30 '18 at 10:32

answered Dec 11 '17 at 11:06

Brian Burns

20,575
8
83
77

What object or module doe the `g` code above refer to? – zozo Aug 14 '18 at 00:28
@zozo woops - was a typo - fixed – Brian Burns Aug 14 '18 at 12:39
Hi @BrianBurns sys.getsizeof(df) is giving memory value in kb ? or mb? – Rahul Kumar Singh Oct 03 '18 at 13:32
@RahulKumarSingh it's in bytes - will note in answer – Brian Burns Oct 04 '18 at 14:16
4

I use `df.info(memory_usage="deep")`, it returns "392.6 MB", whereas `sys.getsizeof(df)` and `df.memory_usage(index=True, deep=True).sum()` both return approximately "411718016" (~ 411MB). Can you please explain why the 3 results are not consistent ? thanks – Chau Pham Oct 29 '18 at 16:35
@Catbuilts hmm, maybe df.info doesn't include the index and the others do? What does `df.memory_usage(deep=True).sum()` return? – Brian Burns Oct 29 '18 at 21:16
2

@BrianBurns: `df.memory_usage(deep=True).sum()` returns nearly the same with `df.memory_usage(index=True, deep=True).sum()`. in my case, the `index` doesnt take much memory. Interestingly enough, I found that `411718016/1024/1024 = 392.6`, so `df.info(memory_usage="deep")` may use `2^10` to convert *byte* to *MB*, which makes me confused. Thanks for your help anyway :D. – Chau Pham Oct 30 '18 at 04:01
1

@Catbuilts Ah, that explains it! `df.info` is returning mebibytes (2^10), not megabytes (10^6) - will amend the answer. – Brian Burns Oct 30 '18 at 09:50
just a reminder. it would be nice to accept an answer as a courtesy to the people who helped you. I believe this is the most comprehensive one. – Kostas Markakis Jul 20 '22 at 08:41

score 57 · Answer 3 · answered Jul 21 '15 at 15:29

I thought I would bring some more data to the discussion.

I ran a series of tests on this issue.

By using the python resource package I got the memory usage of my process.

And by writing the csv into a StringIO buffer, I could easily measure the size of it in bytes.

I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.

In the first experiment I used only floats in my dataset.

This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)

Memory and CSV size in Megabytes as a function of the number of rows with float entries

The second experiment I had the same approach, but the data in the dataset consisted of only short strings.

Memory and CSV size in Megabytes as a function of the number of rows with string entries

It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)

I would love to complete this answer with more experiments, please comment if you want me to try something special.

What is your y axis? – Ilya V. Schurov Feb 05 '18 at 14:15 — Ilya V. Schurov, Feb 05 '18 at 14:15
max_rss and csv size on disk in megabytes – firelynx Feb 06 '18 at 10:37 — firelynx, Feb 06 '18 at 10:37

Jeff · Answer 4 · 2013-08-06T21:05:17.513

32

You have to do this in reverse.

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

Technically memory is about this (which includes the indexes)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 168MB in memory with a 400MB file, 1M rows of 20 float columns

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

The data was random, so compression doesn't help too much

edited Aug 06 '13 at 21:05

answered Aug 06 '13 at 21:00

Jeff

125,376
21
220
187

That is very clever! Any idea how to measure the memory you need to read the file using `read_csv`? – Andy Hayden Aug 06 '13 at 21:15
No idea how to measure AS you read; IIRC it can be up to 2x the final memory needed to hold the data (from wes's article), but I think he brought it down to a constant + final memory – Jeff Aug 06 '13 at 21:23
Ah, I need to re-read, I remembered 2x being some convenient theoretical min for a certain algorithm, if it's even less that's coool. – Andy Hayden Aug 06 '13 at 21:26
You can use [`iotop`](http://guichaz.free.fr/iotop/) like `top`/`htop` for watching (in real time) IO performance. – Phillip Cloud Aug 06 '13 at 22:02
1

`nbytes` will be a gross underestimate if you have e.g. strings in a dataframe. – Sergey Orshanskiy Jan 12 '15 at 01:47
You probably want to note that `!ls` is a `!magic` that only works under Jupyter. – smci Nov 23 '16 at 19:16
What/who is Wes's Article? – Greg Hilston Aug 08 '19 at 20:27

score 12 · Answer 5 · answered Aug 06 '13 at 20:30

12

Yes there is. Pandas will store your data in 2 dimensional numpy ndarray structures grouping them by dtypes. ndarray is basically a raw C array of data with a small header. So you can estimate it's size just by multiplying the size of the dtype it contains with the dimensions of the array.

For example: if you have 1000 rows with 2 np.int32 and 5 np.float64 columns, your DataFrame will have one 2x1000 np.int32 array and one 5x1000 np.float64 array which is:

4bytes*2*1000 + 8bytes*5*1000 = 48000 bytes

answered Aug 06 '13 at 20:30

Viktor Kerkez

45,070
12
104
85

@AndyHayden What do you mean the construction cost? The size of an instance of `DataFrame`? – Phillip Cloud Aug 06 '13 at 20:40
Thanks Victor! @Andy - Any idea how big the construction cost is? – Anne Aug 06 '13 at 20:40
It's not including, but `pandas` have a very efficient implementation of `read_table` in Cython (it's much better than the numpy's loadtxt) so I assume that it parsers and stores the data directly into the `ndarray`. – Viktor Kerkez Aug 06 '13 at 20:41
@PhillipCloud you have to build it, that takes memory.. I seem to remember twice the size being mentioned?... – Andy Hayden Aug 06 '13 at 20:43

Phillip Cloud · Answer 6 · 2015-03-06T17:14:13.557

10

If you know the dtypes of your array then you can directly compute the number of bytes that it will take to store your data + some for the Python objects themselves. A useful attribute of numpy arrays is nbytes. You can get the number of bytes from the arrays in a pandas DataFrame by doing

nbytes = sum(block.values.nbytes for block in df.blocks.values())

object dtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject), so if you have strings in your csv you need to take into account that read_csv will turn those into object dtype arrays and adjust your calculations accordingly.

EDIT:

See the numpy scalar types page for more details on the object dtype. Since only a reference is stored you need to take into account the size of the object in the array as well. As that page says, object arrays are somewhat similar to Python list objects.

edited Mar 06 '15 at 17:14

answered Aug 06 '13 at 20:38

Phillip Cloud

24,919
11
68
88

Thanks Phillip! Just to clarify - for a string we would need 8 bytes for a pointer to a string object, plus the actual string object? – Anne Aug 07 '13 at 13:31
2

Yes, for any object type you'll need an 8 byte pointer + size(object) – Viktor Kerkez Aug 07 '13 at 14:13
1

Suggest df.blocks.values() It looks like df.blocks is now a dict – MRocklin Mar 06 '15 at 17:11

Zaher Abdul Azeez · Answer 7 · 2016-12-15T09:24:10.620

8

This I believe this gives the in-memory size any object in python. Internals need to be checked with regard to pandas and numpy

>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497

edited Dec 15 '16 at 09:24

answered Nov 14 '16 at 09:18

Zaher Abdul Azeez

91
1
5

score 3 · Answer 8 · answered Nov 08 '22 at 05:26

To print human readable results you can try this:

suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def humansize(nbytes):
    i = 0
    while nbytes >= 1024 and i < len(suffixes)-1:
        nbytes /= 1024.
        i += 1
    f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
    return '%s %s' % (f, suffixes[i])

df.memory_usage(index=True, deep=True).apply(humansize)
# Index  128 B
# a      571.72 MB
# b      687.78 MB
# c      521.6 MB
# dtype: object

humansize(df.memory_usage(index=True, deep=True).sum())
# 1.74 GB

Code adapted from this and this answer.

How to estimate how much memory a Pandas' DataFrame will need?

8 Answers8

sys.getsizeof(df)

df.memory_usage()

df.info()

Linked