deque in python pandas

Question

I am using Python's deque() to implement a simple circular buffer:

from collections import deque
import numpy as np

test_sequence = np.array(range(100)*2).reshape(100,2)
mybuffer = deque(np.zeros(20).reshape((10, 2)))

for i in test_sequence:
    mybuffer.popleft()
    mybuffer.append(i)

    do_something_on(mybuffer)

I was wondering if there's a simple way of obtaining the same thing in Pandas using a Series (or DataFrame). In other words, how can I efficiently add a single row at the end and remove a single row at the beginning of a Series or DataFrame?

Edit: I tried this:

myPandasBuffer = pd.DataFrame(columns=('A','B'), data=np.zeros(20).reshape((10, 2)))
newpoint = pd.DataFrame(columns=('A','B'), data=np.array([[1,1]]))

for i in test_sequence:
    newpoint[['A','B']] = i
    myPandasBuffer = pd.concat([myPandasBuffer.ix[1:],newpoint], ignore_index = True)

    do_something_on(myPandasBuffer)

But it's painfully slower than the deque() method.

I doubt wether it's more efficient to do this in pandas. There is no built-in queue-behaviour as far as i know (but you could write your own wrapper around a pandas data-frame using the concat method and/or using index-slices) — dorvak, Nov 20 '13 at 09:28
Hey Andy, thank you for you reply. What do you mean exactly? Could you post an example? Thanks — Fra, Nov 23 '13 at 01:45

score 8 · Accepted Answer · answered Sep 04 '16 at 20:15

As noted by dorvak, pandas is not designed for queue-like behaviour.

Below I've replicated the simple insert function from deque in pandas dataframes, numpy arrays, and also in hdf5 using the h5py module.

The timeit function reveals (unsurprisingly) that the collections module is much faster, followed by numpy and then pandas.

from collections import deque
import pandas as pd
import numpy as np
import h5py

def insert_deque(test_sequence, buffer_deque):
    for item in test_sequence:
        buffer_deque.popleft()
        buffer_deque.append(item)
    return buffer_deque
def insert_df(test_sequence, buffer_df):
    for item in test_sequence:
        buffer_df.iloc[0:-1,:] = buffer_df.iloc[1:,:].values
        buffer_df.iloc[-1] = item
    return buffer_df
def insert_arraylike(test_sequence, buffer_arr):
    for item in test_sequence:
        buffer_arr[:-1] = buffer_arr[1:]
        buffer_arr[-1] = item
    return buffer_arr

test_sequence = np.array(list(range(100))*2).reshape(100,2)

# create buffer arrays
nested_list = [[0]*2]*5
buffer_deque = deque(nested_list)
buffer_df = pd.DataFrame(nested_list, columns=('A','B'))
buffer_arr = np.array(nested_list)

# calculate speed of each process in ipython
print("deque : ")
%timeit insert_deque(test_sequence, buffer_deque)
print("pandas : ")
%timeit insert_df(test_sequence, buffer_df)
print("numpy array : ")
%timeit insert_arraylike(test_sequence, buffer_arr)
print("hdf5 with h5py : ")
with h5py.File("h5py_test.h5", "w") as f:
    f["buffer_hdf5"] = np.array(nested_list)
    %timeit insert_arraylike(test_sequence, f["buffer_hdf5"])

The %timeit results:

deque : 34.1 µs per loop

pandas : 48 ms per loop

numpy array : 187 µs per loop

hdf5 with h5py : 31.7 ms per loop

Notes:

My pandas slicing method was only slightly faster than the concat method listed in the question.

The hdf5 format (via h5py) did not show any advantages. I also don't see any advantages of HDFStore, as suggested by Andy.

This further supports my findings that utilizing stack memory for 2D data structures is not well supported in python. I too love using a maxlen deque for circular buffers, but there are often times when I have a 2D deque and I want to just grab a single column. I then proceed to write a bunch of functions to do this with for loops or use some pushing and popping of a 2D list. Either way I end up forced with one end of a trade-off. I would love to see a data structure with the efficiency of a deque and the ease of slicing a pandas df or numpy array. Perhaps a PEP is in order? — jacob, Nov 04 '19 at 20:41

deque in python pandas

1 Answers1

Linked