Python Pandas global vs passed variable

Question

I am creating a "real-time" process that takes data from a proprietary formatted OHLCVTBA file being updated by SierraChart. Code that reads the data and creates a dataframe using a generator is posted on pastebin. [removed dead link].

I have realized that my structure (new data driven) is wrong and I'm about to reorganize it. PhE's question and Wes's response have taken me in the direction of filling a pre-populated dataframe which works well. My questions here are:

Is it faster to hold my dataframe and pointers as global variables or to pass them to and from the various functions that use them? Also, are there other considerations that should drive this choice?

Thanks.

score 5 · Accepted Answer · edited May 23 '17 at 12:18

5

Local variables are faster to access than global variables in python.

In the context of pandas, this means you should be passing variables into functions where this makes sense (it means they can be found quicker inside the function). Conversely, function calls in python are expensive (if you are calling them lots), which is why numpy/pandas use vectorised functions where possible. Obviously, you have to be careful to ensure all your calculations are done inplace if your doing things inside a function.

I would usually get things working first, in "pythonic"/"pandastic" way, before worrying about speed. Then use %timeit and see if it's fast enough already (usually it is). Add a unittest(s). Tweak for speed, %timeit, %prun and %timeit some more. If it's a big project vbench.

edited May 23 '17 at 12:18

Community

1
1

answered Jun 05 '13 at 09:48

Andy Hayden

359,921
101
625
535

It's great, but does not take into account the passing aroung of many variables (but profiling will, of course). – Elazar Jun 05 '13 at 10:10
@Elazar that's true... otherwise you can be shooting in the dark :) (rather than aiming for the slowest things.) – Andy Hayden Jun 05 '13 at 10:31
Thanks Andy. I have converted my code to Class Form and will add an optional log of the core process timing to confirm that processes operate within the data's timing. [Here's the updated version](http://pastebin.com/Gyy6MNu3) The classes do make it clearer and easier to maintain so it'll be interesting to check the timing when I reach that stage. – John 9631 Jun 05 '13 at 23:54

Elazar · Answer 2 · 2013-06-05T10:12:05.390

2

You will need to profile it, but my guess is that if there is any significant difference at all, it is in favor of globals. The reference is still in memory, and no reference counting happens.

(EDIT: anyway, see @Andy Hayden's link about their relative access time, and the link here, which says that local variables are much faster).

The main consideration is that of "software engineering" - using global data is a bad idea, since it's hard to follow when and where it is being changed. Of course, if you can't fulfill the requirements (runtime) otherwise, then it has to be done; but in order to know it - measure first.

Anyway, I would recommend a different solution - keep this data inside a class. It will cost one more dictionary lookup (the first lookup is the variable name, and it happens anyway; the second is the lookup in the class dict), but it may be more efficient than passing around many objects, and will help the organization of your program.

edited Jun 05 '13 at 10:12

answered Jun 04 '13 at 23:32

Elazar

20,415
4
46
67

Thanks for the response. Ok. I'll set up a simple 1000x version and profile it with iPython. --- Regarding the class - if I set it up in a class(es) including the DF, pointers etc, won't I still have the same overheads but a prettier oo orientated structure (which may be enough reason to do it)? – John 9631 Jun 04 '13 at 23:45
It will be like a single object passed around, instead of many (if that's your problem). and yes, it will cost two dictionary lookups instead of one. – Elazar Jun 04 '13 at 23:48
Tested: it added 61ns to the process or 0.2% to these short functions so it is negligible. http://i.imgur.com/Hzx4zwX.png – John 9631 Jun 05 '13 at 00:27
@John9631 OK, make sense. Keep in mind that this test is for small number of objects, which will fit in the cache. – Elazar Jun 05 '13 at 00:30
Thanks. I will build as Classes (one for the dataframe and one for the file) and pass those. If I later run into issues I can always retrospectively break them up. I'd replace the negative 1 but my reps not yet high enough. – John 9631 Jun 05 '13 at 00:46

Python Pandas global vs passed variable

2 Answers2

Linked