I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations. I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error. Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?