0

I have written a function to perform a rolling sum on multiple columns then append the new rolling summed series as a column on the original data frame. Simple enough.

def rolling_sum(df, columns, w=3):

'''insert rolling sum for each column in list of columns for window (w) in months. '''

    for column in columns:

        print column + 'sum'+ str(3)

        df.index = df.date
        series = df.groupby(['pixel']).rolling(w)[column].sum() # no year. so rolls over dec-jan
        series = series.reset_index(level=[0,1])
        series.rename(index=str, columns={column: column + 'sum' + str(w)}, inplace=True)
        #df = df.merge(series, on=['pixel','date']) # unnecessary
        df = df.merge(series)

        print 'successful merge'

    return df

And 'lo 'n behold it works with one dataframe (1 GB) and it kind of works on another (They are merging on the exact same columns-- both an object 'date' [2012-06] and float64 'pixel' [1000.1]).

The semi failing case seems to be eating up RAM profusely after the second or third column. It is as if it is storing a bunch of unnecessary memory somehow. This df is only 100 MB and I have tried downsizing it to 10 MB and it still does the same thing, but makes it to the fourth column or so.

Any ideas on how to trouble shoot this thing?

Bstampe
  • 689
  • 1
  • 6
  • 16

0 Answers0