0

We all know that any assignment of variables to a DataFrame object in Pandas will only be a reference but not a new instance. However, what if I assign a variable 't' to a tuple that consists of a string and a pandas DataFrame as follows:

df=pd.DataFrame([[1,2,3],[4,5,6]])
t=('example',df)

When I do:

new=t[1]

is the variable new a reference to the original object 'df' (i.e. mutable and exposed) or it is now referred to a new instance (i.e. df is immutable in this case)?

Thank you.

cs95
  • 379,657
  • 97
  • 704
  • 746
user7786493
  • 443
  • 3
  • 6
  • 14
  • `new` is another pointer to the same dataframe that `df` points to. And that dataframe is mutable. For example, when I do `new is df` I get `True`. And `id(new) == id(df)` is also `True`. – piRSquared Aug 24 '17 at 14:45
  • it depends, on my machine `new` is a copy but it maybe a view, also why do you think it's immutable – EdChum Aug 24 '17 at 14:45
  • Unless you use df.copy(), It will refer back to same dataframe the varaible is assigned to no matter where you store the dataframe. As @pir said another pointer thats it. – Bharath M Shetty Aug 24 '17 at 14:46
  • @Bharathshetty that's not always the case I did `new = new * 2` and `df` isn't modified, it really depends – EdChum Aug 24 '17 at 14:48
  • 1
    @EdChum the example you cited `new = new * 2` I wouldn't expect to alter `df`. `new * 2` creates a copy and you assign it to `new` which overwrites what was there. `new.loc[:] = new * 2` would be a different story. – piRSquared Aug 24 '17 at 14:50
  • @pir, I am asking this question in the forum because I thought tuple (similar to the behavior string) that it always gives you a new object unlike list. So I thought if we call a DataFrame in a tuple will create a new object without having the need to do the .copy() – user7786493 Aug 24 '17 at 14:50
  • @EdChum I didn't know it depends on machine too. Thank you. It would be nice to know how it depends. Like how and why it differs from one system to another? – Bharath M Shetty Aug 24 '17 at 14:51
  • @piRSquared that's true but it's difficult to capture all the use cases when making an assumption that `new` is always a reference or copy so context matters – EdChum Aug 24 '17 at 14:51
  • @Bharathshetty I think @piRSquared pointed out that this will create a new object and overwrite, but if you did `new.loc[1,1] = 10` then this would affect the orig df. The point here is that context matters and it becomes ambiguous unless you explicitly take a `copy()` – EdChum Aug 24 '17 at 14:53
  • 1
    In regards to the tuple of mutable objects. [**See This Question**](https://stackoverflow.com/q/9755990/2336654) – piRSquared Aug 24 '17 at 14:54
  • I appreciate the prompt and active responses. While treasuring the great benefit of pandas.DataFrame object, I also find the limitations of it in terms of the ambiguity of its mutability. Can I safely say that it is always a good practice to do a copy() to avoid mutation? In general does doing copy() every time significantly reduce performance of a program? – user7786493 Aug 24 '17 at 14:55
  • @user7786493 I think you are mixing the concepts of mutability, references, and views. You are using the term mutability but it sounds like you are concerned with altering the contents of a first dataframe when changing the contents of a second dataframe that was created from the first. If I'm correct, then using `df2 = df1.copy()` is your guaranteed solution. – piRSquared Aug 24 '17 at 15:04
  • Aggregated all queries and responses in a [community wiki](https://stackoverflow.com/a/45865052/4909087). Please feel free to edit. – cs95 Aug 24 '17 at 15:06
  • @user7786493 Do you have any more questions? – cs95 Aug 24 '17 at 15:19
  • @cᴏʟᴅsᴘᴇᴇᴅ personally do not recommend `inplace=True` flag – BENY Aug 24 '17 at 15:29
  • @Wen Me neither! I only mentioned it for the sake of completeness in my answer. – cs95 Aug 24 '17 at 15:30
  • @cᴏʟᴅsᴘᴇᴇᴅ I suffered a lot from it when it is`chained indexing`:( – BENY Aug 24 '17 at 15:34
  • @COLDSPEED why you dont recommend inplace=True (i thought this save the step of reassigning a variable one more time after an operation eg df.drop(....inplace=True) instead of df=df.drop(....) – user7786493 Aug 24 '17 at 15:38

1 Answers1

1

Is the variable new a reference to the original object 'df' (i.e. mutable and exposed) or it is now referred to a new instance?

Why don't you just...

In [516]: id(df)
Out[516]: 4481803432

In [517]: id(t[1])
Out[517]: 4481803432

I thought tuple (similar to the behavior string) that it always gives you a new object unlike list...

Wrong. The only difference between tuples and lists is that the former is immutable. Both would become a container for the same references.

So I thought if we call a DataFrame in a tuple will create a new object without having the need to do the .copy()

It does not. You will need to explicitly call .copy() if you want a copy. Otherwise you're working with the same reference.

Can I safely say that it is always a good practice to do a copy() to avoid mutation?

Not really, because most of the dataframe mutation methods return a copy of the dataframe. Changes are never made inplace unless you explicitly request it (such as using the inplace=True flag).

You should know that setting inplace=True does not improve performance because a copy is internally created and then assigned back to the original.

cs95
  • 379,657
  • 97
  • 704
  • 746