11

On the documentation, it says

Numpy representation of NDFrame -- Source

What does "Numpy representation of NDFrame" mean? Will modifying this numpy representation affect my original dataframe? In other words, will .values return a copy or a view?

There are answers to questions in StackOverflow implicitly suggesting (relying on) that a view be returned. For example, in the accepted answer of Set values on the diagonal of pandas.DataFrame,np.fill_diagonal(df.values, 0) is used to set all values on the diagonal part of df to 0. That is a view is returned in this case. However, as shown in @coldspeed's answer, sometimes a copy is returned.

This feels very basic. It is just a bit weird to me because I do not have a more detailed source of .values.


Another experiment that returns a view in addition to the current experiments in @coldspeed's answer:

df = pd.DataFrame([["A", "B"],["C", "D"]])

df.values[0][0] = 0

We get

df
    0   1
0   0   B
1   C   D

Even though it is mixed type now, we can still modify original df by setting df.values

df.values[0][1] = 5
df
    0   1
0   0   5
1   C   D
Tai
  • 7,684
  • 3
  • 29
  • 49

2 Answers2

12

TL;DR:

It's an implementation detail if a copy is returned (then changing the values would not change the DataFrame) or if values returns a view (then changing the values would change the DataFrame). Don't rely on any of these cases. It could change if the pandas developers think it would be beneficial (for example if they changed the internal structure of DataFrame).


I guess the documentation has changed since the question was asked, currently it reads:

pandas.DataFrame.values

Return a Numpy representation of the DataFrame.

Only the values in the DataFrame will be returned, the axes labels will be removed.

It doesn't mention NDFrame anymore - but simply mentions a "NumPy representation of the DataFrame". A NumPy representation could be either a view or a copy!

The documentation also contains a Note about mixed dtypes:

Notes

The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64 will result in a float64 dtype.

From these Notes it's obvious that accessing the values of a DataFrame that contains different dtypes can (almost) never return a view. Simply because it needs to put the values into an array of the "lowest-common-denominator" dtype and that involves a copy.

However it doesn't say anything about the view / copy behavior and that's by design. jreback mentioned on the pandas issue tracker 1 that this really is just an implementation detail:

this is an implementation detail. since you are getting a single dtyped numpy array, it is upcast to a compatible dtype. if you have mixed dtypes, then you almost always will have a copy (the exception is mixed float dtypes will not copy I think), but this is a numpy detail.

I agree this is not great, but it has been there from the beginning and will not change in current pandas. If exporting to numpy you need to take care.

Even the documentation of Series mentions nothing about a view:

pandas.Series.values

Return Series as ndarray or ndarray-like depending on the dtype

It even mentions that it might not even return a plain array depending on the dtype. And that certainly includes the possibility (even if it's only hypothetical) that it returns a copy. It does not guarantee that you get a view.


When does .values return a view and when does it return a copy?

The answer is simply: It's an implementation detail and as long as it's an implementation detail there won't be any guarantees. The reason it's an implementation detail is because the pandas developers want to make certain that they can change the internal storage if they want to. However in some cases it's impossible to create a view. For example with a DataFrame containing columns of different dtypes.

There might be advantages if you analyze the behavior to date. But as long as that's an implementation detail you shouldn't really rely on it anyways.

However if you're interested: Pandas currently stores columns with the same dtype internally as multi-dimensional array. That has the advantage that you can operate on rows and columns very efficiently (at least as long as they have the same dtype). But if the DataFrame contains mixed types it will have several internal multi-dimensional arrays. One for each dtype. It's not possible to create a view that points into two distinct arrays (at least for NumPy) so when you have mixed dtypes you'll get a copy when you want the values.


A side-note, your example:

df = pd.DataFrame([["A", "B"],["C", "D"]])

df.values[0][0] = 0

Isn't mixed-dtype. It has a specific dtype: object. However object arrays can contain any Python object, so I can see why you would say/assume that it's of mixed types.


Personal note:

Personally I would have preferred that the values property only ever returns views or errors when it cannot return a view and an additional method (e.g. as_array) that only ever returns copies even if it would be possible to get a view. That would certainly make the behavior more predictable and avoid some surprises like having a property doing an expensive copy is certainly unexpected.


1 This question has been mentioned in the issue post, so maybe the docs changed because of this question.

Community
  • 1
  • 1
MSeifert
  • 145,886
  • 38
  • 333
  • 352
8

Let's test it out.

First, with pd.Series objects.

In [750]: s = pd.Series([1, 2, 3])

In [751]: v = s.values

In [752]: v[0] = 10000

In [753]: s
Out[753]: 
0    10000
1        2
2        3
dtype: int64

Now, for DataFrame objects. First, consider non-mixed dtypes -

In [780]: df = pd.DataFrame(1 - np.eye(3, dtype=int))

In [781]: df
Out[781]: 
   0  1  2
0  0  1  1
1  1  0  1
2  1  1  0

In [782]: v = df.values

In [783]: v[0] = 12345

In [784]: df
Out[784]: 
       0      1      2
0  12345  12345  12345
1      1      0      1
2      1      1      0

Modifications are made, so that means .values returned a view.

Now, consider a scenario with mixed dtypes -

In [755]: df = pd.DataFrame({'A' :[1, 2], 'B' : ['ccc', 'ddd']})

In [756]: df
Out[756]: 
   A    B
0  1  ccc
1  2  ddd

In [757]: v = df.values

In [758]: v[0] = 123

In [759]: v[0, 1] = 'zzxxx'

In [760]: df
Out[760]: 
   A    B
0  1  ccc
1  2  ddd

Here, .values returns a copy.


Observation

.values for Series returns a view regardless of dtypes of each row, whereas for DataFrames this depends. For homogenous dtypes, a view is returned. Otherwise, a copy.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Why setting values to `df.values` changes df? Cool experiment by the way. – Tai Jan 11 '18 at 07:25
  • 1
    @Tai For efficiency purposes, I believe. It is cheaper to return a view so that you don't duplicate data unless you need to operate on it. If it's an object column, it can't be helped, the thing needs to be re-constructed. – cs95 Jan 11 '18 at 07:32
  • 1
    This is a very educated guess. But I think we should not make a conclusion so quick on these experiments. – Tai Jan 11 '18 at 07:40
  • 2
    Found a counter example to your great experiments! See above. – Tai Jan 11 '18 at 07:53
  • @Tai I don't think that's a counter example. What does `df.dtypes` say? All objects I assume? – ayhan Jan 11 '18 at 07:57
  • @ayhan I thought object columns were exempt from returning views... or am I mistaken here? – cs95 Jan 11 '18 at 07:59
  • @ayhan I think the setup is the same with COLDSPEED. His/her df is also all types object but results is different. Here it returns a copy and mine returns a view. – Tai Jan 11 '18 at 08:00
  • 2
    @cᴏʟᴅsᴘᴇᴇᴅ As far as I know since that also can be represented with a single numpy array it is not exempt but I cannot be 100% sure. – ayhan Jan 11 '18 at 08:02
  • @Tai Well, in that case, I might be mistaken as to how object columns are treated. I'm pretty sure your example is in alignment with this line of thinking, but my conclusion may need a little adjustment. – cs95 Jan 11 '18 at 08:03
  • @cᴏʟᴅsᴘᴇᴇᴅ I have no idea. Please adjust as you feel right. Maybe change "Conclusion" to a lighter word? Let's not draw conclusion for so soon. – Tai Jan 11 '18 at 08:06
  • Why you deleted the example...? – Tai Jan 11 '18 at 08:11
  • @Tai You can bring it back, but I don't see what it achieves. Now that I've changed the conclusion to an observation, the "counter example" is no longer a "counter"... – cs95 Jan 11 '18 at 08:14
  • Yes. But it is just good to contrast with your great examples. I am sorry to use an offensive word. I will watch out my word choice. – Tai Jan 11 '18 at 08:15
  • @Tai Alright, added it back, it's not a problem ;-) – cs95 Jan 11 '18 at 08:17
  • Nice explanation. *In general*, I would be wary of changing underlying NumPy arrays directly. Not strictly related to the question but, for instance, changing `df.columns.values` has side effects beyond changing values. – jpp Jul 28 '18 at 12:15