6
In[216]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In[217]: bar = foo.ix[:1]
In[218]: bar
Out[218]: 
   a  b
0  1  3
1  2  4

A view is created as expected.

In[219]: bar['a'] = 100
In[220]: bar
Out[220]: 
     a  b
0  100  3
1  100  4
In[221]: foo
Out[221]: 
     a  b
0  100  3
1  100  4
2    3  5

If view is modified, so is the original dataframe foo. However, if the assignment is done with None, then a copy seems to be made. Can anyone shed some light on what's happening and maybe the logic behind?

In[222]: bar['a'] = None
In[223]: bar
Out[223]: 
      a  b
0  None  3
1  None  4
In[224]: foo
Out[224]: 
     a  b
0  100  3
1  100  4
2    3  5
Anthony
  • 1,513
  • 11
  • 17
  • 6
    I don't know as much about the details of Pandas as numpy, but I'm willing to bet that what's happening is that, by forcing the column to change its dtype from `I4` to `object`, you're causing it to allocate a new array for the column, and then you're writing to that new array instead of to the array shared with the original DataFrame. (I've posted this as a comment rather than an answer because, even if I'm right, a good answer should explain exactly how this works, not just hand-wave at it…) – abarnert Sep 04 '14 at 17:57
  • 1
    @abarnert That's exactly what is happening behind the scenes. Go ahead and post as an answer. – Jeff Sep 04 '14 at 18:14
  • @Jeff: OK, but I still think it might be better to have pointers to where this is explained in the docs, rather than something a numpy user can guess about how Pandas is probably implemented… – abarnert Sep 04 '14 at 18:46
  • I put up a answer. It is REALLY well warned / documented in MANY many places. If users don't read the docs then not much can be done. – Jeff Sep 04 '14 at 18:53
  • Thanks Jeff and others! I did come across the "Returning a view versus a copy" section of the doc. Sorry to not have gone through it in details. Will do that now :) – Anthony Sep 04 '14 at 20:06

2 Answers2

7

When you assign bar['a'] = None, you're forcing the column to change its dtype from, e.g., I4 to object.

Doing so forces it to allocate a new array of object for the column, and then of course it writes to that new array instead of to the old array that's shared with the original DataFrame.

abarnert
  • 354,177
  • 51
  • 601
  • 671
6

You are doing a form of chained assignment, see here why this is a really bad idea.

See this question as well here

Pandas will generally warn you that you are modifying a view (even more so in 0.15.0).

In [49]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})

In [51]: foo
Out[51]: 
   a  b
0  1  3
1  2  4
2  3  5

In [52]: bar = foo.ix[:1]

In [53]: bar
Out[53]: 
   a  b
0  1  3
1  2  4

In [54]: bar.dtypes
Out[54]: 
a    int64
b    int64
dtype: object

# this is an internal method (but is for illustration)
In [56]: bar._is_view
Out[56]: True

# this will warn in 0.15.0
In [57]: bar['a'] = 100
/usr/local/bin/ipython:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/usr/local/bin/python

In [58]: bar._is_view
Out[58]: True

# bar is now a copied object (and will replace the existing dtypes with new ones).
In [59]: bar['a'] = None

In [60]: bar.dtypes
Out[60]: 
a    object
b     int64
dtype: object

You should never rely on whether something is a view (even in numpy), except in certain very performant situations. It is not a guaranteed construct, depending on the memory layout of the underlying data.

You should very very very rarely try to set the data for propogation thru a view. and doing this in pandas is almost always going to cause trouble, when you mixed dtypes. (In numpy you can only have a view on a single dtype; I am not even sure what a view on a multi-dtyped array which changes the dtype does, or if its even allowed).

Community
  • 1
  • 1
Jeff
  • 125,376
  • 21
  • 220
  • 187