9

To understand my question, I should first point out that R datatables aren't just R dataframes with syntaxic sugar, there are important behavioral differences : column assignation/modification by reference in datatables avoids the copying of the whole object in memory (see the example in this quora answer) as it is the case in dataframes.

I've found on multiple occasions that the speed and memory differences that arise from data.table's behavior is a crucial element that allows one to work with some big datasets while it wouldn't be possible with data.frame's behavior.

Therefore, what I'm wondering is : in Python, how do Pandas' dataframes behave in this regard ?

Bonus question : if Pandas' dataframes are closer to R dataframes than to R datatables, and have the same down side (a full copy of the object when assigning/modifying column), is there a Python equivalent to R's data.table package ?


EDIT per comment request : Code examples :

R dataframes :

# renaming a column
colnames(mydataframe)[1] <- "new_column_name"

R datatables :

# renaming a column
library(data.table)
setnames(mydatatable, 'old_column_name', 'new_column_name')

In Pandas :

mydataframe.rename(columns = {'old_column_name': 'new_column_name'}, inplace=True)
François M.
  • 4,027
  • 11
  • 30
  • 81
  • as one who is not an `R` user - you may want to explain further the differences of dataframes versus datatables besides a single downside. I fear this question may be too broad. however, have you looked into `numpy`? `pandas` works very closely with it – MattR Dec 14 '17 at 17:31
  • Basically, if you want to change the name of a column in an R dataframe, the memory handling of the operation means the whole dataframe will at one point exist twice in memory. R datatables have a different behavior which will avoid this unnecessary process, the object will always exist only once. R datatables have other differences with R dataframes (such as aggregation features, and a different syntax), but they are irrelevant to the question as I'm mostly curious about how the memory is handled in Pandas dataframes. – François M. Dec 14 '17 at 17:36
  • Your question is very important. It will become clearer if you can provide example code of manipulation of a column of a data.frame so that one can see how best to accomplish the same thing with python pandas dataframe. – rnso Dec 14 '17 at 17:38
  • There you go, please see the edit of the question. – François M. Dec 14 '17 at 17:49
  • 1
    Seems very likely that any operation offering the `inplace` option can do it ... in place. – Frank Dec 14 '17 at 17:49
  • 1
    True. It seems maybe not, though, considering this answer : https://stackoverflow.com/a/47097979/4348534 (and the question comments too) – François M. Dec 14 '17 at 17:51
  • 3
    @Frank from what I've understood, `inplace=True` often just means pointers to underlying data are moved around, and the *python* object keeps it's identity, but it doesn't prevent the copying of data – juanpa.arrivillaga Dec 14 '17 at 17:59
  • @juanpa.arrivillaga Ah ok, interesting. – Frank Dec 14 '17 at 18:01
  • @Frank of course, that may not be the case for something like re-naming columns (I hope it isn't...) but I know that it isn't a guarantee... – juanpa.arrivillaga Dec 14 '17 at 18:02
  • I think some operation on each of the elements of a column or operation involving 2 columns should be the test to compare memory usage and efficiency, rather than just resetting column names. – rnso Dec 14 '17 at 18:04
  • 1
    @fmalaussena that’s an inadequate description of what happens in R. While you will get a new data frame, the unchanged columns will share memory with the previous data frame. – hadley Dec 14 '17 at 20:30

1 Answers1

12

Pandas operates more like data.frame in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:

First define a program that will test this:

%%file df_memprofile.py
import numpy as np
import pandas as pd

def foo():
    x = np.random.rand(1000000, 5)
    y = pd.DataFrame(x, columns=list('abcde'))
    y.rename(columns = {'e': 'f'}, inplace=True)
    return y

Then load the memory profiler and run + profile the function

%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()

I get the following output:

Filename: /Users/jakevdp/df_memprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.1 MiB     66.1 MiB   def foo():
     5    104.2 MiB     38.2 MiB       x = np.random.rand(1000000, 5)
     6    104.4 MiB      0.2 MiB       y = pd.DataFrame(x, columns=list('abcde'))
     7    142.6 MiB     38.2 MiB       y.rename(columns = {'e': 'f'}, inplace=True)
     8    142.6 MiB      0.0 MiB       return y

You can see a couple things:

  1. when y is created, it is just a light wrapper around the original array: i.e. no data is copied.

  2. When the column in y is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as when x is created in the first place).

So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.


Edit: Note that rename() has an argument copy that controls this behavior, and defaults to True. For example, using this:

y.rename(columns = {'e': 'f'}, inplace=True, copy=False)

... results in an inplace operation without copying data.

Alternatively, you can modify the columns attribute directly:

y.columns = ['a', 'b', 'c', 'd', 'f']
jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • I think the profiler is a line off. I commented out the rename() line and I still get a 38MB increase at line 7: "7 158.4 MiB 38.1 MiB #y.rename(columns = {'e': 'f'}, inplace=True)" – Marmaduke Dec 14 '17 at 20:26
  • @Marmaduke I can't reproduce that. When I comment-out the line, the profiler skips it. – jakevdp Dec 14 '17 at 20:31
  • As you pointed out elsewhere, it was caching the old version. Restarting the kernel restored sanity. – Marmaduke Dec 14 '17 at 20:40