Live Statistics calculation and pandas

Question

I have a live feed of logging data coming in through the network. I need to calculate live statistics, like the one in my previous question. How would I design this module? I mean, it seems unrealistic (read, bad design) to keep applying a groupby function to the entire df every single time a message arrives. Can I just update one row and its calculated column gets auto-updated?

JFYI, I'd be running another thread that will print read values from the df and print to the a webpage every 5 seconds or so..

Of course, I could run groupby-apply every 5 seconds instead of doing it in real time, but I thought it'd be better to keep the df and the calculation independent of the printing module.

Thoughts?

You could just filter on the group that has actually been updated and then call the function just on that group, it sounds like it would be better to just create a dataframe for each group and call your function on it, it would depend on the characteristics of the data — EdChum, Jun 08 '14 at 20:13
I don't think pandas is a good choice for data that is constantly growing. Changing the size of a pandas structure is inefficient. If you're always grouping on the same key, you may be better off using a dictionary that maps keys to groups (stored as lists, say, or possibly individual Series/DataFrames). How large is your data? — BrenBarn, Jun 08 '14 at 20:14
@BrenBarn, I see... Data is about 1000 rows to start with and then it increases at 200 rows/second — Lelouch Lamperouge, Jun 08 '14 at 20:54
@LelouchLamperouge still I think BrenBarn is correct. If you are just updating existing data then the performance hit may not be an issue, it sounds like all you'd be doing would some stats on the updated values, note that groupby itself does nothing only when you apply a function does it do something. If you know which group is to be updated then you can call `get_group('updated_item)` can call apply on just that group see: http://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-dataframe-by-key — EdChum, Jun 08 '14 at 20:59

score 0 · Answer 1 · answered Jun 09 '14 at 07:00

groupby is pretty damn fast, and if you preallocate slots for new items you can make it even faster. In other words, try it and measure it for a reasonable amount of fake data. If it's fast enough, use pandas and move on. You can always rewrite it later.

Live Statistics calculation and pandas

1 Answers1