Memory error with simple for loop merging series onto a dataframe

Question

I have written a function to perform a rolling sum on multiple columns then append the new rolling summed series as a column on the original data frame. Simple enough.

def rolling_sum(df, columns, w=3):

'''insert rolling sum for each column in list of columns for window (w) in months. '''

    for column in columns:

        print column + 'sum'+ str(3)

        df.index = df.date
        series = df.groupby(['pixel']).rolling(w)[column].sum() # no year. so rolls over dec-jan
        series = series.reset_index(level=[0,1])
        series.rename(index=str, columns={column: column + 'sum' + str(w)}, inplace=True)
        #df = df.merge(series, on=['pixel','date']) # unnecessary
        df = df.merge(series)

        print 'successful merge'

    return df

And 'lo 'n behold it works with one dataframe (1 GB) and it kind of works on another (They are merging on the exact same columns-- both an object 'date' [2012-06] and float64 'pixel' [1000.1]).

The semi failing case seems to be eating up RAM profusely after the second or third column. It is as if it is storing a bunch of unnecessary memory somehow. This df is only 100 MB and I have tried downsizing it to 10 MB and it still does the same thing, but makes it to the fourth column or so.

Any ideas on how to trouble shoot this thing?

See if this is your answer: https://stackoverflow.com/questions/50051210/avoiding-memory-issues-for-groupby-on-large-pandas-dataframe — Daniel Scott, Jan 10 '19 at 22:45
Groupby loops can be very taxing to store in memory, but dask.dataframes allow this to be done in parallel. — Daniel Scott, Jan 10 '19 at 22:46
Interesting. I will try the dask.dataframes. It is unusually becaues the df I am working with is much smaller than one that already succesfully went through this function. — Bstampe, Jan 11 '19 at 20:19

Memory error with simple for loop merging series onto a dataframe

0 Answers0