14

Summary: This doesn't work:

df[df.key==1]['D'] = 1

but this does:

df.D[df.key==1] = 1

Why?

Reproduction:

In [1]: import pandas as pd

In [2]: from numpy.random import randn

In [4]: df = pd.DataFrame(randn(6,3),columns=list('ABC'))

In [5]: df
Out[5]: 
          A         B         C
0  1.438161 -0.210454 -1.983704
1 -0.283780 -0.371773  0.017580
2  0.552564 -0.610548  0.257276
3  1.931332  0.649179 -1.349062
4  1.656010 -1.373263  1.333079
5  0.944862 -0.657849  1.526811

In [6]: df['D']=0.0

In [7]: df['key']=3*[1]+3*[2]

In [8]: df
Out[8]: 
          A         B         C  D  key
0  1.438161 -0.210454 -1.983704  0    1
1 -0.283780 -0.371773  0.017580  0    1
2  0.552564 -0.610548  0.257276  0    1
3  1.931332  0.649179 -1.349062  0    2
4  1.656010 -1.373263  1.333079  0    2
5  0.944862 -0.657849  1.526811  0    2

This doesn't work:

In [9]: df[df.key==1]['D'] = 1

In [10]: df
Out[10]: 
          A         B         C  D  key
0  1.438161 -0.210454 -1.983704  0    1
1 -0.283780 -0.371773  0.017580  0    1
2  0.552564 -0.610548  0.257276  0    1
3  1.931332  0.649179 -1.349062  0    2
4  1.656010 -1.373263  1.333079  0    2
5  0.944862 -0.657849  1.526811  0    2

but this does:

In [11]: df.D[df.key==1] = 3.4

In [12]: df
Out[12]: 
          A         B         C    D  key
0  1.438161 -0.210454 -1.983704  3.4    1
1 -0.283780 -0.371773  0.017580  3.4    1
2  0.552564 -0.610548  0.257276  3.4    1
3  1.931332  0.649179 -1.349062  0.0    2
4  1.656010 -1.373263  1.333079  0.0    2
5  0.944862 -0.657849  1.526811  0.0    2

Link to notebook

My question is:

Why does only the 2nd way work? I can't seem to see a difference in selection/indexing logic.

Version is 0.10.0

Edit: This should not be done like this anymore. Since version 0.11, there is .loc . See here: http://pandas.pydata.org/pandas-docs/stable/indexing.html

N8888
  • 670
  • 2
  • 14
  • 20
K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
  • As said in the answers it seems to be a numpy problem: have a look at [this question](http://stackoverflow.com/q/9470604/1301710) for a similar problem. I'm not sure if it is a problem of view vs. copy. – bmu Jan 07 '13 at 19:53
  • I understand now that it is cleary (and actually simply) the difference of view vs copy. First method only provides a copy that is garbage collected. Second method provides a view therefore the setting is done at the original dataframe. (see Dougal's comments below) – K.-Michael Aye Jan 07 '13 at 21:49

2 Answers2

17

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In df[df.key==1]['D'] you first do boolean slicing (leading to a copy of the Dataframe), then you choose a column ['D'].

In df.D[df.key==1] = 3.4, you first choose a column, then do boolean slicing on the resulting Series.

This seems to make the difference, although I must admit that it is a little counterintuitive.

Edit: The difference was identified by Dougal, see his comment: With version 1, the copy is made as the __getitem__ method is called for the boolean slicing. For version 2, only the __setitem__ method is accessed - thus not returning a copy but just assigning.

Danica
  • 28,423
  • 6
  • 90
  • 122
Thorsten Kranz
  • 12,492
  • 2
  • 39
  • 56
  • That's what I thought at first too, but there must be something else going on. `df[df.key==1] = 1000` will actually assign 1000 to all of the values in the slice, so it can't be a copy. I guess there is some magic happening in the __setattr__ or __setitem__ methods. – cxrodgers Jan 07 '13 at 09:51
  • 1
    but as I do a boolean slicing on the resulting Series, that should be a copy as well, shouldn't it? So why does the assignment work that way? – K.-Michael Aye Jan 07 '13 at 09:53
  • Look at Dougals comment above. With version 1, the copy is made as the __getitem__-method is called for the boolean slicing. For version 2, only the __setitem__-method is accessed - thus not returning a copy but just assigning. – Thorsten Kranz Jan 07 '13 at 09:58
  • 4
    @K.-MichaelAye In the first way, you first construct a copy with `__getitem__` and then call `__setitem__` on that copy, which is then immediately garbage-collected. In the second way, you construct a view with `__getitem__` and then call `__setitem__` on the view. – Danica Jan 07 '13 at 09:58
4

I am pretty sure that your 1st way is returning a copy, instead of a view, and so assigning to it does not change the original data. I am not sure why this is happening though.

It seems to be related to the order in which you select rows and columns, NOT the syntax for getting columns. These both work:

df.D[df.key == 1] = 1
df['D'][df.key == 1] = 1

And neither of these works:

df[df.key == 1]['D'] = 1
df[df.key == 1].D = 1

From this evidence, I would assume that the slice df[df.key == 1] is returning a copy. But this is not the case! df[df.key == 1] = 0 will actually change the original data, as if it were a view.

So, I'm not sure. My sense is that this behavior has changed with the version of pandas. I seem to remember that df.D used to return a copy and df['D'] used to return a view, but this doesn't appear to be true anymore (pandas 0.10.0).

If you want a more complete answer, you should post in the pystatsmodels forum: https://groups.google.com/forum/?fromgroups#!forum/pystatsmodels

cxrodgers
  • 4,317
  • 2
  • 23
  • 29
  • 3
    `df[df.key == 1]` _does_ actually return a copy (as Thorsten's answer points out). The reason `df[df.key == 1] = 0` modifies the original is that, although the syntax is a bit misleading, that's not actually doing the same thing at all; the non-assignment version calls `__getitem__` and the assignment version `__setitem__`. It's like how if we have `l = [0, 1, 2]`, then `l[1]` returns the int 1 but `l[1] = 5` modifies the original. – Danica Jan 07 '13 at 09:54