3

Say we have the following dataframe:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],  
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

shown below:

> df
     A      B         C         D
0  foo    one  0.846192  0.478651
1  bar    one  2.352421  0.141416
2  foo    two -1.413699 -0.577435
3  bar  three  0.569572 -0.508984
4  foo    two -1.384092  0.659098
5  bar    two  0.845167 -0.381740
6  foo    one  3.355336 -0.791471
7  foo  three  0.303303  0.452966

And then I do the following:

df2 = df
df  = df[df['C']>0]

If you now look at df and df2 you will see that df2 holds the original data, whereas df was updated to only keep the values where C was greater than 0.

I thought Pandas wasn't supposed to make a copy in an assignment like df2 = df and that it would only make copies with either:

  1. df2 = df.copy(deep=True)
  2. df2 = copy.deepcopy(df)

What happened above then? Did df2 = df make a copy? I presume that the answer is no, so it must have been df = df[df['C']>0] that made a copy, and I presume that, if I didn't have df2=df above, there would have been a copy without any reference to it floating in memory. Is that correct?

Note: I read through Returning a view versus a copy and I wonder if the following:

Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy.

explains this behavior.

Josh
  • 11,979
  • 17
  • 60
  • 96
  • http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy – acushner Mar 20 '14 at 15:11
  • 2
    Thanks @acushner I went through that already, but could not find the answer to my question. – Josh Mar 20 '14 at 15:12
  • 2
    FYI, you will only get a view if its a single dtype (and even then its not guaranteed; depends on how you are slicing). – Jeff Mar 20 '14 at 16:22

1 Answers1

7

It's not that df2 is making the copy, it's that the df = df[df['C'] > 0] is returning a copy.

Just print out the ids and you'll see:

print id(df)
df2 = df
print id(df2)
df = df[df['C'] > 0]
print id(df)
ragesz
  • 9,009
  • 20
  • 71
  • 88
acushner
  • 9,595
  • 1
  • 34
  • 34
  • Thanks I updated the OP. I presume that garbage collection takes care of copies without references. Is that correct? – Josh Mar 20 '14 at 15:27
  • 1
    it sure will. i mean, it will get rid of the original, too, assuming there are no references to it. – acushner Mar 20 '14 at 15:41