0

I'm dealing with a large dataframe (~100,000x1000) that I eventually output using df.to_csv(). All my inputs which I turn into this large dataframe come transposed relative to the output, so when building the large dataframe it ends transposed relative to the output. At the very end I transpose: df.T.to_csv(). I know the return value of df.T is the transposed dataframe which leads to my question, by not saving the df.T does it "help" my memory usage? Phrased differently, is df.T.to_csv() better than dfT=df.T and dfT.to_csv() run separately? Aside from memory as there any advantages to one method over the other?

In summary which method is better and why?:

method 1:

df.T.to_csv()

method 2:

dfT=df.T
dfT.to_csv()
noah
  • 2,616
  • 13
  • 27

1 Answers1

2

Overall, the two approaches are practically identical for this use case. Consider: The script still causes the transpose to be calculated and stored ion memory in order to be able to act on it. The only real difference might come in what happens after this line of code runs.

In the first case, df.T.to_csv() calculates and stores the transpose dataframe, writes it to file, and then the implicit instruction is that automated garbage collection is free to do what it will with the allocated memory for the object.

In the second case, because you've assigned it, the implicit instruction is to maintain the allocated memory and the object stored therein, until the script finishes running. The only real "advantage" I can think of to the second method is that you can reuse the transpose dataframe for other things if you need to.

This certainly holds true in my test case (using the memit memory profiler magic in jupyter notebook):

df=pd.DataFrame(np.random.rand(10000,10000))

%%memit
df.T.to_csv('test_transpose.csv')

peak memory: 929.00 MiB, increment: 34.18 MiB

%%memit
dfT=df.T
dfT.to_csv('test_transpose.csv')

peak memory: 929.84 MiB, increment: 33.66 MiB

And, using timing instead of memory profiling:

%%timeit
df.T.to_csv('test_transpose.csv')

2min 49s ± 6.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
dfT=df.T
dfT.to_csv('test_transpose.csv')

2min 51s ± 4.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • "or until you explicitly tell it to remove the object from memory." Not really a thing in Python. – juanpa.arrivillaga Jun 05 '19 at 18:32
  • @juanpa.arrivillaga You can `del`ete it, and [force garbage collection](https://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python), though I suppose in that case you could also just force garbage collection after running the first method with the same result. Editied into my answer, thanks for pointing that out – G. Anderson Jun 05 '19 at 18:42
  • No, my point is `del` does not delete *objects* it deleted *names*. Python does not expose any way to directly manipulate the memory allocation/deallocation of objects – juanpa.arrivillaga Jun 05 '19 at 20:23
  • Maybe I'm mistaken, but if you remove all references to an object (set the ref count to 0) then force GC, which (according to my understanding) should clear all allocated memory with 0 refs, does that not effectively accomplish object deletion? Again, not trying to disagree, trying to increase my understanding. – G. Anderson Jun 05 '19 at 20:33
  • No, that accepted answer is terrible. The `gc` module only control the cyclic garbage collector, which isn't relevant here (it only handles reference cycles). When a ref count goes to zero, the memory is reclaimed *immediately* this *can* happen with del, but that isn't what del does. – juanpa.arrivillaga Jun 05 '19 at 20:37
  • So, to rephrase, `del` _allows_ the memory to be reclaimed, but _whether_ and _when_ it is actually reclaimed is up to system processes completely out of the user's control. Is that accurate? – G. Anderson Jun 05 '19 at 20:44
  • Yes, basically `del` deletes *references*. In CPython, objects get reclaimed *immediately* when reference counts reach zero, but that is an implementation detail. Jython, for example, does not work that way, and uses the JVM garbage collector – juanpa.arrivillaga Jun 05 '19 at 21:08
  • 1
    Thanks a lot for your help! I've removed the mistaken information from the answer, and I'm one of today's [lucky 10,000](https://xkcd.com/1053/) – G. Anderson Jun 05 '19 at 21:19