1

I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:

df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)

However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000

After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately

for index,group in df.groupby(['Category','Subcategory']):
    print pd.rolling_median(group['value'],7,min_period=7)

Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations. I ran the code several times, and every time it showed the memory error for different items.

I created some dummy values for anyone to test, here:

index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)

for index,group in sample.groupby(level=0):
    try:
        pd.rolling_median(group[0],7,7)
    except MemoryError:
        print a
        print pd.rolling_median(group[0],7,7)

If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-

I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error. Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?

CoderBC
  • 1,262
  • 2
  • 13
  • 30
  • I don't have the problem (python 3.4, pandas 0.17). What pandas version do you have? – IanS Mar 10 '16 at 14:14
  • Yes, that could be an issue, I have 2.7.11 python and pandas 0.17 – CoderBC Mar 10 '16 at 14:24
  • @IanS I tried in python 3.5.1: same error. – CoderBC Mar 10 '16 at 14:46
  • Have you tried looking at this question: http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – IanS Mar 10 '16 at 15:14
  • I've tried with 500,000 (*25) rows instead of 34,000 in your example, and it still works (and my python session only uses about as much memory as my browser...). – IanS Mar 10 '16 at 15:21
  • You might also want to look at this: http://stackoverflow.com/questions/35782929/pandas-groupby-memory-deallocation. It has no answer (for now) but the OP already makes a few helpful suggestions. – IanS Mar 10 '16 at 15:41

2 Answers2

0

The groupby doesn't look right and should change

df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)

to

df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)

Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.

Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.

Hope that helps.

bamdan
  • 836
  • 7
  • 21
  • Yes you are right about that issue, in my code actually the category and subcategory are an index, so what I am doing is something like this: `df['rolling_mean']=df['value'].groupby(level=[0,1]).apply(pd.rolling_median,7,7) ` And it does work for me if instead of rolling_median I use a rolling_mean. But my question is not about a length mismatch – CoderBC Mar 10 '16 at 14:28
  • Ok sure thanks for explanation. Will have a look and see if I can reproduce. – bamdan Mar 10 '16 at 14:54
0

The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.

This was the bug: https://github.com/pydata/pandas/issues/11697

Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html

CoderBC
  • 1,262
  • 2
  • 13
  • 30