189

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.

If I have, for example,

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

I understand that a query returns a copy so that something like

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as

df.iloc[3] = 70

or

df.ix[1,'B':'E'] = 222

will change df. But I'm lost when it comes to more complicated cases. For example,

df[df.C <= df.B] = 7654321

changes df, but

df[df.C <= df.B].ix[:,'B':'E']

does not.

Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?


Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.

cottontail
  • 10,268
  • 18
  • 50
  • 51
orome
  • 45,163
  • 57
  • 202
  • 418

3 Answers3

197

Here's the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you should never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.

Scarabee
  • 5,437
  • 5
  • 29
  • 55
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • This opened my eyes: "The chained indexing is 2 separate python operations and thus cannot be *reliably* intercepted by pandas..." That explains why the things seem erratic to me and why experimenting wasn't working to pin things down. – orome Apr 25 '14 at 15:00
  • 1
    Regarding the 2nd and 5th rules (to see if I understand): If I set something I've indexed (with `.loc` for example) on the LHS of an assignment, it will set in place (2nd rule); but the same-seeming operation perfumed in "two steps" one with the indexing on the RHS of an assignment to a variable, and a second with an assignment to that variable will not change the original, because the first stem make a copy (5th rule). Is that right? – orome Apr 25 '14 at 15:07
  • Is that's what's happening in my final ("never do") example — the first indexing, `[df.C <= df.B]`, is getting a copy (5th rule), while the second, `.ix[:,'B':'E']`, is (irrelevantly, since the first got a copy) setting that (2nd rule). – orome Apr 25 '14 at 15:09
  • 7
    ``.query`` will ALWAYS return a copy because of what its doing (and not a view), because its evaluated by n numexpr. So i'll add that to the 'rules' – Jeff Apr 25 '14 at 15:19
  • yes, the 2-step operation is the issue; never do this! (as its insidious as it *can* work, just not reliably) – Jeff Apr 25 '14 at 15:20
  • And have I got the details right about the rules being applied right? – orome Apr 25 '14 at 15:24
  • Can you say a bit more about the 4th rule ("An indexer that gets on a single-dtyped object is almost always a view...") I'm nut sure I understand this one, and it sounds unpredictable. – orome Apr 25 '14 at 15:37
  • 7
    pandas relies on numpy to determine whether a view is generated. In a single dtype case (which could be a 1-d for a series, a 2-d for a frame, etc). numpy *may* generate a view; it depends on what you are slicing; sometimes you can get a view and sometimes you can't. pandas doesn't rely on this fact at all as its not always obvious whether a view is generated. but this doesn't matter as loc doesn't rely on this when setting. However, when chain indexing this is very important (and thus why chain indexing is bad) – Jeff Apr 25 '14 at 15:49
  • So is it *not* the case that simple indexing (with just `[]` or attribute access) reliably returns a setable view on the original dataframe. E.g. can I not rely on `df.B[2] = ...` or `df[B][2] = ...` changing `df`? – orome Apr 25 '14 at 18:07
  • 1
    your example is still chained; however, on a single-dtyped frame with a single-indexer it would always return a view I believe. But MUCH better to simply, do ``df.loc[2,'B'] = value`` anyhow – Jeff Apr 25 '14 at 18:15
  • Ah, but the simple case would work always (e.g. `df[['C','D']] = df[['D','C']]`; right? It's the chaining that's risky (and here just happens to work because of the single type. Right? – orome Apr 25 '14 at 18:20
  • that's not chaining, that's a direct assignment, so ok. its when you get a series, THEN index that is the problem – Jeff Apr 25 '14 at 19:07
  • 5
    Many thanks Jeff, your reply is most useful. What is your source/reference on this topic? – Kamixave Aug 19 '14 at 12:46
  • 6
    Then first, thanks for your great work! And second, if you have enough time I think it would be great to add a paragraph similar to your main reply in the doc. – Kamixave Aug 19 '14 at 13:00
  • 3
    certainly would a take a pull-request to add/revise the docs. go for it. – Jeff Aug 19 '14 at 13:06
  • Just a couple clarifications. I'll put one per comment for readabiliy. "An indexer that sets" - does it include simple square brackets `df[...]` or only square brackets after `iloc`, `loc`, `ix`, `at`, `iat`, like `df.iloc[...]` etc? – max Feb 03 '16 at 15:53
  • In "an indexer that gets on a single-dtyped object", is the word "on" a typo? As in, does it mean an indexer that gets (=returns) a single-dtyped object? – max Feb 03 '16 at 15:53
  • @Jeff, to your statement "All operations generate a copy" it seems like the assignment operator actually creates a view. Could you comment on these operations: `p1 = pandas.DataFrame(); p2 = p1 ; p1['C'] = 3 ; >>>p1 Empty DataFrame Columns: [C] Index: []; >>>p2 Empty DataFrame Columns: [C] Index: []; p1.loc['C'] = 3 ; >>>p1 C C 3; >>>p2 C C 3` – jxramos Apr 13 '17 at 20:08
  • I think something that muddles discovery into these distinctions is that when you call type() on a concrete DataFrame you get `` but you get the same results running it on a view. Not sure if there's a lightweight way to output `` but that would greatly help I'd imagine. Other frameworks make a distinction between the view and the data, eg [WPF]( https://msdn.microsoft.com/en-us/library/system.windows.data.collectionviewsource(v=vs.110).aspx), [Qt](http://doc.qt.io/qt-4.8/model-view-programming.html#view-classes) – jxramos Apr 13 '17 at 21:37
  • Actually my comment about the assignment operator comes from my prior ignorance to [base Python behavior](http://stackoverflow.com/questions/8463907/assignment-of-objects-and-fundamental-types) on this matter making it not unique to Pandas. – jxramos Apr 13 '17 at 22:27
  • Does that mean that there is currently no way to extract a subset of columns in pandas without generating a copy, if the columns have different dtypes? – Konstantin Nov 27 '17 at 09:35
  • 1
    It's not clear to me what you mean by "subsequent override". Do you mean that later rules override previous rules? – user2357112 Apr 25 '18 at 23:42
  • 1
    @user2357112 - yes, that is what is meant. You start at the top and the last applicable rule sets the behavior. – BeeOnRope Mar 12 '19 at 20:50
  • 1
    @Jeff, when you say "all operations create a copy" could you specify what you mean by an operation (e.g. is df['my_col'] an operation? What about df.my_col etc?) – Jinglesting Aug 12 '19 at 20:15
  • 1
    I don't believe `.query` rule holds any longer, the code above results in `SettingWithCopyWarning` for me, even if I explicitly write `foo = df.query('2 < index <= 5', engine='numexpr')`. – random Nov 02 '19 at 22:02
  • @user2357112supportsMonica I believe you're correct - the latter ones tend to be more specific, and have higher priority/precedence, just like rules in CSS. – flow2k Feb 12 '20 at 21:12
  • Initially I was confused by "Your example of chained indexing df[df.C <= df.B].loc[:,'B':'E'] is not guaranteed to work (and thus you shoulld never do this)." But then I realized @Jeff meant this should not be used when for an assignment. I believe this is fine for getting/accessing. – flow2k Feb 12 '20 at 21:15
  • @Jeff @cs95 How do we proceed if we want a settable view, and we don't know when we are writing the code whether the DF has only one dtype or many? IIUC you are saying that even `df.loc[Row-or-rows, Col-or-cols] = x` is not guaranteed to be setting df instead of a copy thereof? (This may be the same as @Konstantin 's question, but I am not sure, and anyway, no one answered them.) – Robert P. Goldman Jun 26 '20 at 22:52
  • English is not my native language. I don't really understand the phrase "gets on" here. Does it mean "produce" or "accept"? – JoyfulPanda Aug 29 '21 at 17:00
  • @RobertP.Goldman @Konstantin were you able to solve this? i had a df with different dtypes which i converted to a single dtype, but the behaviour seems exactly the same. that is, with `df2 = df1.iloc[:len(df1)-1]` df2 takes up twice the memory of df1. didn't make a difference if it was single dtype or multiple. – Jayen Jan 03 '22 at 00:38
2

Since pandas 1.5.0, pandas has Copy-on-Write (CoW) mode that makes any dataframe/Series derived from another behave like a copy on views. When it is enabled, a copy is created only if data is shared with another dataframe/Series. With CoW disabled, operations like slicing creates a view (and unexpectedly changed the original if the new dataframe is changed) but with CoW, this creates a copy.

pd.options.mode.copy_on_write = False   # disable CoW (this is the default as of pandas 2.0)
df = pd.DataFrame({'A': range(4), 'B': list('abcd')})

df1 = df.iloc[:4]                       # view
df1.iloc[0] = 100
df.equals(df1)                          # True <--- df changes together with df1



pd.options.mode.copy_on_write = True    # enable CoW (this is planned to be the default by pandas 3.0)
df = pd.DataFrame({'A': range(4), 'B': list('abcd')})

df1 = df.iloc[:4]                       # copy because data is shared
df1.iloc[0] = 100
df.equals(df1)                          # False <--- df doesn't change when df1 changes

One consequence is, pandas operations are faster with CoW. In the following example, in the first case (when CoW is disabled), all intermediate steps create copies, while in the latter case (when CoW is enabled), a copy is created only at assignment (all intermediate steps are on views). You can see that there's a runtime difference because of that (in the latter case, data was not unnecessarily copied).

df = pd.DataFrame({'A': range(1_000_000), 'B': range(1_000_000)})

%%timeit
with pd.option_context('mode.copy_on_write', False):  # disable CoW in a context manager
    df1 = df.add_prefix('col ').set_index('col A').rename_axis('index col').reset_index()
# 30.5 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
with pd.option_context('mode.copy_on_write', True):   # enable CoW in a context manager
    df2 = df.add_prefix('col ').set_index('col A').rename_axis('index col').reset_index()
# 18 ms ± 513 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cottontail
  • 10,268
  • 18
  • 50
  • 51
-2

Here is something funny:

u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]

The first three seem to be all references of df, but the last one is not!

ouflak
  • 2,458
  • 10
  • 44
  • 49
  • 6
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 07 '21 at 06:18