3

Lets say I have a dataframe like

df = pd.DataFrame({'A':[1,2,3,4],'B':[1,3,4,7]})
   A  B
0  1  1
1  2  3
2  3  4
3  4  7

When I assign some data to transpose of a dataframe, there is no error i.e

df.T['C'] = 3

There is no change in the dataframe after running this.

But the question is where is the data being stored ? Why did't it give any error? I was expecting an error for this kind of assignment or an output like

   A  B
0  1  1
1  2  3
2  3  4
3  4  7
C  3  3

Neither is happening when I did df.T['C'] = 3

Edit: as @Zero mention we might have to do

df = df.T.assign(C=3).T # Which is like df.loc['C',:] = 3
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108

3 Answers3

2

df.T is a different object. The changes you make will not be reflected in the original df. Where is it? Since there is no variable pointing to it, either it has already been collected by the garbage collector or it is waiting to be collected. You cannot access it.

What you can do is to create a new variable

transposed = df.T

transposed['C'] = 3

transposed
Out: 
   0  1  2  3  C
A  1  2  3  4  3
B  1  3  4  7  3   

The same thing happens when you call any method that returns a new DataFrame. df.drop(0)['C'] = 2, df.reset_index()['C'] = 3 or df.drop_duplicates()['C'] = 3. The original DataFrame always stays the same. There is another DataFrame created with that exact row assigned to it but it becomes inaccessible as soon as you execute that statement because you don't have any variables pointing to it. For CPython's garbage collection, there is some useful information here.


Edit from @Bharath:

(an explanation given by one of my teachers)

T returns a copy. That means new memory is allocated to store the new object. If you look up python garbage collection you’ll find that each object in memory keeps a counter of how many pointers are pointing to it.

When the garbage collection is run, it will find this object in memory and see that it has zero pointers. Because it has zero pointers the garbage collection will reclaim the memory and the object is gone forever.

So it is recommended to keep a single pointer pointing to the object by assigning to a name (or variable).

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • I knew about copy but garbage collector is new to me. I was curious why it was lost. And why wasn't any error being shown. Should we report that to pandas ? – Bharath M Shetty Nov 22 '17 at 13:01
  • could you point me to some pythonic example of data been sent to garbage? I mean any SO question regarding that – Bharath M Shetty Nov 22 '17 at 13:04
  • 1
    @Bharath In some sense it is like chained assignment but this might not be that common. It also does not raise any warning if you do `df.drop(0)['C'] = 2` (it also does not modify the df). Since you are doing these things on the fly, without assigning it to a variable, it might not even have a chance to raise a warning. As for the garbage collection, [this](https://stackoverflow.com/a/9449723/2285236) might be helpful. For the reference count you do not only consider the references you create though; the ones pandas create also important and somewhat unknown (you need to dig a little deep). – ayhan Nov 22 '17 at 13:35
  • Thank you ayhan thats a really nice intuitive explanation. And one more thing is my question too broad or too dumb to get a downvote? I didn't know any of this to be honest. It would be really nice if you add the sample explanation in your main answer . – Bharath M Shetty Nov 22 '17 at 13:39
  • 1
    @Bharath No, of course not. Python's way of handling variables/names is different than many languages and most people struggle about these references. I wouldn't read too much into the downvotes. There could be millions of reasons. They might be logical or illogical. Unless the voter explains himself/herself, I wouldn't care about it. – ayhan Nov 22 '17 at 13:46
  • @ayhan I do not think garbage collector has to do anything with this misuse of OOP. – Elis Byberi Nov 22 '17 at 13:49
  • @Bharath `df.T()` is returning a new DataFrame and you are not storing it anywhere ;-). Use new_df = df.T() instead. – Elis Byberi Nov 22 '17 at 13:57
  • @ElisByberi I knew and I too know pandas from quite some time :) Maybe I was unclear I wanted to know the reasons for no errors and where data went. – Bharath M Shetty Nov 22 '17 at 13:58
  • @ayhan Python's way of handling variables/names in this particular case is the same as in all programming languages. – Elis Byberi Nov 22 '17 at 14:36
  • @ElisByberi I believe the reason Bharath is confused and expecting the original df to change or a warning to be raised is because Python's way of handling names leads to unexpected modifications for people who are new to the language and they start to expect the unexpected in different situations. This is not something I base my answer on, just an idea why he is confused. – ayhan Nov 22 '17 at 15:47
  • @ayhan I understood what you did but it is related with OOP only. It is because OOP is not expected to behave in this way. This way of using methods is called function programming. `OOP says that bringing together data and its associated behavior in a single location (called an “object”) makes it easier to understand how a program works. FP says that data and behavior are distinctively different things and should be kept separate for clarity.` Mixing these paradigms makes code confusing! – Elis Byberi Nov 22 '17 at 16:06
  • @ayhan I understood even more better from the explanation given by my teacher I added that here. – Bharath M Shetty Nov 22 '17 at 17:50
  • @Bharath Talking about garbage collector is off-topic in this particular case. It is about misuse of OOP. Your teacher is right about garbage collector but it has nothing to do with how `DataFrame.T()` method does work. – Elis Byberi Nov 22 '17 at 17:52
  • @ElisByberi I wasn't keeping track of the comments but I wanted the garbage collector answer. Why I was loosing data. I agree its a misuse of OOP. Its not just T there might be more like this, but that answered most of my current and future questions : ) – Bharath M Shetty Nov 22 '17 at 17:54
  • @Bharath Again, this is not why you was losing data. You was losing data because you wasn't storing them in a variable. It is like writing: `lambda x: x`, it must be `f = lambda x: x`. – Elis Byberi Nov 22 '17 at 17:56
  • Sir all of these comments would fit right inside your answer. That would help future users. Its just getting messed and lengthy in comment section : ). You are more into functional programming – Bharath M Shetty Nov 22 '17 at 17:59
  • @Bharath These comments does not fit in my answer because `return` statement does explain it all! Have a nice day all of you! – Elis Byberi Nov 22 '17 at 18:05
  • @ElisByberi The fact that almost all methods on DataFrames return new objects is a design choice and in my opinion it's a good design choice. It allows chaining methods together which is very useful in data cleaning process in which pandas excels. You'll see similar implementations in different data cleaning tools (dplyr of R, for example). – ayhan Nov 22 '17 at 20:16
  • @ayhan OOP method chaining is done by returning self. A simple method definition would be: `def T(self, *args): self._dataframe = transposed_dataframe; return self`. and use it like `class.T().method().anothermethod()`. I do not see any reason to mess up it with "FP using classes". – Elis Byberi Nov 22 '17 at 20:30
  • 1
    @ElisByberi So if I want to calculate an aggregated sum I should just go ahead and modify the original DataFrame? That would definitely mess up my workflow. Half of my code would contain `copy()`s. – ayhan Nov 22 '17 at 20:41
  • @ayhan Good point! This is a misuse of OOP too. **If a method is not object oriented**, why do you call it a method? Declare it static or put it out of class definition. – Elis Byberi Nov 22 '17 at 20:55
  • 1
    @ElisByberi I think we are running in circles. :) Why would you want to write `filter(groupby(reset_index(set_index(drop(rename(df, arg), arg), arg), arg), arg), arg)` while you can write `df.rename(arg).drop(arg).set_index(arg).reset_index(arg).groupby(arg).filter(arg)`. That's unintuitive (you are calling the functions in reverse order) and very hard to read/track. – ayhan Nov 22 '17 at 21:15
1

Method T does return super(DataFrame, self).transpose(1, 0, **kwargs).
It will create another DataFrame.

Elis Byberi
  • 1,422
  • 1
  • 11
  • 20
  • @Bharath It is not an explanation but a hint: Method `transpose` in parent class of `DataFrame` does this instead `return self._constructor(new_values, **new_axes).__finalize__(self)` This is an inplace transposing. – Elis Byberi Nov 22 '17 at 13:43
  • @Bharath This is pandas's documentation for method `T()`: `DataFrame.T, Transpose index and columns`. This is ridiculous! Hahahaha! – Elis Byberi Nov 22 '17 at 13:51
1

Adding to the existing answers, I'd like to draw your attention to the canny similarity between -

df

   A  B
0  1  1
1  2  3
2  3  4
3  4  7

df.T['C'] = 3

df

   A  B
0  1  1
1  2  3
2  3  4
3  4  7

And, a similar case with python lists -

l = [1, 2, 3, 4, 5]
l[:].append(6)

l
[1, 2, 3, 4, 5]

What happens in both cases is that a new object is created! The operation is then applied to that newly created object, following which, that object is garbage collected since there are no active references pointing to it. You see that with this -

import sys

sys.getrefcount(df.T)
1

There's only one reference to that object (the reference at that point of time, which is subsequently lost). This becomes easy to understand once you accept the fact that df.T returns a completely new object (I've said this already, but I'm trying to drive home the point) -

id(df.T)
4612098928

id(df.T)
4612098872

id(df.T)
4612098592

In summary, you are attempting to modify a fresh object to which you have no reference, and you do not see any changes to the original because you did not make any.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    A code to show and a snippet to explain whats happening is always an awesome answer. Thank You – Bharath M Shetty Nov 25 '17 at 10:20
  • @Bharath The truth is that there is no reference at all to the newly created object. Garbage collector is not to blame that you did "lose" the object. I will create a question and give an answer by myself (answers from other peers are welcome too). I am really tired explaining this over and over to anyone asking about this. That question of mine would be a good reference for everybody. I will ping you when it will be ready. – Elis Byberi Nov 25 '17 at 12:21
  • @ElisByberi `The truth is that there is no reference at all to the newly created object`. This is want Ahyan said in his answer and what my teacher said. And I`m curious what you might put out there. – Bharath M Shetty Nov 25 '17 at 12:36
  • @ElisByberi Why do you think Bharath did not understand and needed you to reiterate what has already been mentioned? – cs95 Nov 25 '17 at 23:32
  • @cᴏʟᴅsᴘᴇᴇᴅ I think he understood it (I understood it from his last comment here). I did repeat to Bharath that talking about garbage collector is off-topic rather than redundant in this particular case. I will explain it in one of my questions in near future. – Elis Byberi Nov 26 '17 at 09:58