3

I ran into this problem when I was trying to make sure some properties of data frame's view.

Suppose I have a dataframe defined as: df = pd.DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3)) and a view of this dataframe defined as: df1 = df.iloc[:3, :]. We now have two dataframes as following:

print(df)
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17

print(df1)

   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8

Now I want to output the id of a particular cell of these two dataframes:

print(id(df.loc[0, 'a']))
print(id(df1.loc[0, 'a']))

and I have the output as:

140114943491408
140114943491408

The weird thing is, if I continuously execute those two lines of 'print id' code, the ids change as well:

140114943491480
140114943491480

I have to emphasize that I did not execute the 'df definition' code when I execute those two 'print id' code, so the df and df1 are not redefined. Then, in my opinion, the memory address of each element in the data frame should be fixed, so how could the output changes?

A more weird thing happens when I keep executing those two lines of 'print id' codes. In some rare scenarios, those two ids even do not equal to each other:

140114943181088
140114943181112

But if I execute id(df.loc[0, 'a']) == id(df1.loc[0, 'a']) at the same time, python still output True. I know that since df1 is a view of df, their cells should share one memory, but how come the output of their ids could be different occasionally?

Those strange behaviors make me totally at lost. Could anyone explain those behaviors? Are they due to the characteristics of data frame or the id function in python? Thanks!

FYI, I am using Python 3.5.2.

Chen Li
  • 4,824
  • 3
  • 28
  • 55
Y. Gao
  • 959
  • 1
  • 7
  • 7
  • You are not getting the id of a "cell", you are getting the `id` of the object returned by the `.loc` accessor, which is a boxed version of the underlying data. – juanpa.arrivillaga May 21 '18 at 03:27
  • I tried to run the same program. I am getting the ids as equal every-time I am printing their ids. >>> print(id(df.loc[0, 'a'])) 4402589368 >>> print(id(df1.loc[0, 'a'])) 4402589368 >>> print(id(df.loc[0, 'a'])) 4402589368 >>> print(id(df1.loc[0, 'a'])) 4402589368 >>> print(id(df.loc[0, 'a'])) 4402589368 >>> print(id(df1.loc[0, 'a'])) 4402589368 >>> print(id(df1.loc[0, 'a'])) 4402589368 I think, you must be redefine the df and df1 again. or Running the program – Shubham Agrawal May 21 '18 at 03:50
  • 1
    @ShubhamAgrawal read my answer. It is definitely possible with redefining the Dataframes. If you actually understand what is going on, what should be surprising is that the IDs are the same, not that they are different – juanpa.arrivillaga May 21 '18 at 03:56
  • Because you're getting a copy, not a view. Duplicate of [In Pandas, does .iloc method give a copy or view?](https://stackoverflow.com/questions/47972633/in-pandas-does-iloc-method-give-a-copy-or-view) – smci May 21 '18 at 04:41
  • @smci no, **that is not the issue**. See my answer, mutating `df` affects `df1`. – juanpa.arrivillaga May 21 '18 at 05:21
  • @juanpa.arrivillaga: ah ok. The part `id(df.loc[0, 'a']) == id(df1.loc[0, 'a'])` is well-hidden. Your answer is good – smci May 21 '18 at 05:35
  • Say, didn't there used to be a tag \[[tag:id]\]? Would come in useful here. – smci May 21 '18 at 05:37

1 Answers1

3

You are not getting the id of a "cell", you are getting the id of the object returned by the .loc accessor, which is a boxed version of the underlying data.

So,

>>> import pandas as pd
>>> df = pd.DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3))
>>> df1 = df.iloc[:3, :]
>>> df.dtypes
a    int64
b    int64
c    int64
dtype: object
>>> df1.dtypes
a    int64
b    int64
c    int64
dtype: object

But since everything in Python is an object, your loc method must return an object:

>>> x = df.loc[0, 'a']
>>> x
0
>>> type(x)
<class 'numpy.int64'>
>>> isinstance(x, object)
True

However, the actual underlying buffer is a primitive array of C fixed-size 64-bit signed integers. They are not Python objects, they are "boxed" to borrow a term from other languages which mix primitive types with objects.

Now, the phenomenon you are seeing with all objects having the same id:

>>> id(df.loc[0, 'a']), id(df.loc[0, 'a'])
(4539673432, 4539673432)
>>> id(df.loc[0, 'a']), id(df.loc[0, 'a']), id(df1.loc[0,'a'])
(4539673432, 4539673432, 4539673432)

Occurs because in Python, objects are free to re-use the memory address of recently reclaimed objects. Indeed, when you create your tuple of id's, the object's returned by loc only exist long enough to get passed and processed by the first invocation of id, the second time you use loc, the object, already deallocated, simply re-uses the same memory. You can see the same behavior with any Python object, like a list:

>>> id([]), id([])
(4545276872, 4545276872)

Fundamentally, id's are only guaranteed to be unique for the lifetime of the object. Read more about this phenomenon here. But, note, in the following case, it will always be different:

>>> x = df.loc[0, 'a']
>>> x2 = df.loc[0, 'a']
>>> id(x), id(x2)
(4539673432, 4539673408)

Since you maintain references around, the objects are not reclaimed, and require new memory.

Note, for many immutable objects, the interpreter is free to optimize and return the same exact object. In CPython, this is the case with "small ints", the so called small-int cache:

>>> x = 2
>>> y = 2
>>> id(x), id(y)
(4304820368, 4304820368)

But this is an implementation detail that should not be relied upon.

If you want to prove to yourself that your data-frames are sharing the same underlying buffer, just mutate them and you'll see the same change reflected across views:

>>> df
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df.loc[0, 'a'] = 99
>>> df
    a   b   c
0  99   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
    a  b  c
0  99  1  2
1   3  4  5
2   6  7  8
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Excellent explanation! I believe it answers all the aspects about this question. One small detail that is still unclear to me is the 'boxed' procedure which 'borrow a term from other languages' as you mentioned in your answer. Could you further explain what is that? It will be fine to me if you use C/C++ terminology to explain. Thanks! – Y. Gao May 22 '18 at 01:53
  • 1
    @Y.Gao so, Python doesn't have primitive data types. Instead, everything is an object. You can think of `numpy.ndarray` objects as object-oriented wrappers around primitive arrays. Since the actual underlying buffers hold primitive data-types, to bring it into the Python interpreter level, it needs to be "boxed" into a Python object. This occurs every time you select an element from the numpy array, even if it is the same element over again. – juanpa.arrivillaga May 22 '18 at 02:36