0

Let's say I take a stream of incoming data (very fast) and I want to view various stats for a window (std deviation, (say, the last N samples, N being quite large). What's the most efficient way to do this with Python?

For example,

df=ps.DataFrame(np.random.random_sample(200000000))
df2 = df.append([5])

Is crashing my REPL environment in visual studio.

Is there a way to append to an array without this happening? Is there a way to tell which operations on the dataframe are computed incrementally other than by doing timeit on them?

EdChum
  • 376,765
  • 198
  • 813
  • 562
Blaze
  • 1,530
  • 2
  • 15
  • 24
  • Generally np and pandas perform well when the array is not growing, by repeatedly appending to it you will periodically force it to allocate a new memory block and copy the values which may explain why it borks when you append just a single element. does this post help you: http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – EdChum Oct 04 '15 at 18:44

1 Answers1

0

I recommend building a circular buffer out of a numpy array.

This will involve keeping track of an index to your last updated point, and incrementing that when you add a new value.

import numpy as np

circular_buffer = np.zeros(200000000, dtype=np.float64)
head = 0

# stream input

    circular_buffer[head] = new_value

    if head == len(circular_buffer) - 1:
        head = 0
    else:
        head += 1

Then you can compute the statistics normally on circular_buffer.

If You Don't Have Enough RAM

Try implementing something similar in bquery and bcolz. These store your data more efficiently than numpy (using compression) and offer similar performance. Bquery now has mean and standard deviation. Note: I'm a contributer to bquery

Waylon Flinn
  • 19,969
  • 15
  • 70
  • 72