14

There's lots of questions on StackOverflow about chained indexing and whether a particular operation makes a view or a copy. (for instance, here or here). I still don't fully get it, but the amazing part is the official docs say "nobody knows". (!?!??) Here's an example from the docs; can you tell me if they really meant that, or if they're just being flippant?

From https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing

def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!
   return foo

Seriously? For that specific example, is it really true that "nobody knows" and this is non-deterministic? Will that really behave differently on two different dataframes? The rules are really that complex? Or did the guy mean there is a definite answer but just that most people aren't aware of it?

user2543623
  • 1,452
  • 2
  • 15
  • 24
  • 1
    Yes, this is frustrating. To add to the pain, that same page later says: > "This can work at times, but it is not guaranteed to, and therefore should be avoided:" `dfc = dfc.copy()` So, how are we supposed to ensure that a DataFrame which is passed to a function is not just a copy or slice of another DataFrame?? – Mike Williamson May 06 '20 at 10:03

3 Answers3

7

I think I can demonstrate something to clarify your situation, in your example, initially it will be a view but once you try to modify by adding a column it turns into a copy. You can test this by looking at the attribute ._is_view:

In [29]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
def doSomething(df):
    a = df[['b','c']]
    print('before ', a._is_view)
    a['d'] = 0
    print('after ', a._is_view)

doSomething(df)
df

before  True
after  False
Out[29]:
          a         b         c
0  0.108790  0.580745  1.820328
1  1.066503 -0.238707 -0.655881
2 -1.320731  2.038194 -0.894984
3 -0.962753 -3.961181  0.109476
4 -1.887774  0.909539  1.318677

So here we can see that initially a is a view on the original subsection of the original df, but once you add a column to this, this is no longer true and we can see that the original df is not modified.

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • But in your example, it's working a particular way, with rules that you can understand and explain. But if that were the case for all dataframes, why would the official docs say "no one knows"? Are they implying the behavior may be different for other data frames? Cause if it always works the way you said, then the docs could just offer a rule of "if you want X, then always do Y". – user2543623 Aug 23 '16 at 16:12
  • I don't think that example in the docs is a good example to me, in the case of chained indexing then a warning will be raised, here it's ambiguous as to whether the fact you take a reference to a view of the original df should it add a new column to the original df or not. In this case it doesn't. – EdChum Aug 23 '16 at 16:48
4

Here's the core bit of documentation that I think you may have missed:

Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

So there's an underlying numpy array that has some sort of memory layout. pandas is not concerned with having any sort of knowledge about that. I didn't read the docs too thoroughly besides that, but I assume they have some kind of approach that you should be taking instead, if you're actually wanting to set values.

Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • 1
    Yes, I saw that line; but that's not helpful. What is this alternative that I *should* be taking instead? The above example looks like a very reasonable thing to do, so if that's not allowed, then what should we do instead? Call .copy() after every single method just in case?!? – user2543623 Aug 23 '16 at 13:21
  • but in your code example you take a subset of your original df and then you try to add a new column, so what's the intention here? A new column to the original df or to a copy of the df? I don't think that this should be regarded as unambiguous – EdChum Aug 23 '16 at 13:46
  • @EdChum The code is also from the docs. I am surprised actually, because I thought `df[['bar', 'baz']]` always returns a copy (based on [this](http://stackoverflow.com/questions/11285613/selecting-columns#comment15006657_11287278)) – ayhan Aug 23 '16 at 13:49
  • 1
    @ayhan I guess it depends on the underlying np array and memory layout, but codewise the semantics of the code snippet are unclear to me and I'd always explicitly call `copy()` on the subset to ensure I'm working on a copy without relying on any assumptions – EdChum Aug 23 '16 at 14:18
  • @ayhan additionally you could use the attribute `._is_view` which will return `True` or `False` which on my system returns `True` when you take a subselection of the df: `In [25]: a = df[['b','c']] a._is_view Out[25]: True` versus: `In [26]: a = df[['b','c']].copy() a._is_view Out[26]: False` see related: http://stackoverflow.com/questions/26879073/checking-whether-data-frame-is-copy-or-view-in-pandas – EdChum Aug 23 '16 at 14:24
  • @EdChum When I tried, `a._is_view` returned `False` (before calling `copy()`) so it seems that is also uncertain. I guess the best thing to do is to be explicit like you said. – ayhan Aug 23 '16 at 14:32
  • 1
    @ayhan I don't see that but it depends on whether you try to modify the view, see my answer – EdChum Aug 23 '16 at 14:33
  • 1
    @ayhan yes, the code *is* from the docs, but it's an illustration of something that you probably shouldn't do - or at least an ambiguous case. – Wayne Werner Aug 23 '16 at 17:17
  • @WayneWerner -- why do you think that example is something you shouldn't do? If you're not supposed to do that thing in the example, then what *should* you do instead? Call .copy() after every single method just in case? The docs don't offer anything better, other than "nobody knows". – user2543623 Aug 23 '16 at 19:13
  • Because it says (at the top) `Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided.`. But you're right that it's not explicit about telling you what you *should* do instead. There may be some other questions on SO about that, at least glancing at the related questions there's some reading there. If you can't find anything, though, I'd highly recommend actually asking that question: "here's what the docs say, but they don't tell me what to do. What *should* I do?" (obviously with a [mcve]) – Wayne Werner Aug 23 '16 at 19:44
3

Here's an example I thought did a good job of illustrating the inconsistency.

I subset the dataframe, which returns a view. I can then overwrite the values in an entire column, but depending on how I do that syntactically, I get different results.

df = pd.DataFrame(np.random.randn(100, 100))
x = df[(df > 2).any(axis=1)]
print x._is_view
>>> True

# Prove that below we are referring to the exact same slice of the dataframe
assert (x.iloc[:len(x), 1] == x.iloc[:, 1]).all()

# Assign using equivalent notation to below
x.iloc[:len(x), 1] = 1
print x._is_view
>>> True

# Assign using slightly different syntax
x.iloc[:, 1] = 1
print x._is_view
>>> False
Cyrus
  • 1,216
  • 1
  • 8
  • 12