pipe several pandas dataframes

Question

I am trying to run several data frames through a pipeline to permanently alter each data frame but the changes are not persisting outside of the for loop. Can someone tell me the correct syntax to do this? All edits assign and return a data frame as edit_1g() does. Thank you.

# create pipeline to preprocess the data:
def pipeline_1(df):
    df1=(df.pipe(edit_2a)
       .pipe(edit_2b)
       .pipe(edit_2d)
       .pipe(edit_2e)
       .pipe(edit_1f)
       .pipe(edit_1j)
       .pipe(edit_1g)           
       .pipe(edit_2h)
        )
    return df1

# list the data frames we want to run through our pipeline:
dfs = {'df_orders':df_orders, 'df_accts_summary':df_accts_summary, 'df_accts1':df_accts1, 
       'df_traders_summary':df_traders_summary, 'df_traders1':df_traders1,
       'df_tag76_summary':df_tag76_summary, 'df_tag761':df_tag761}

print('data frames altered via pipeline_1: \n')
for key, values in dfs.items():
    values = pipeline_1(values)       # changes aren't persisting outside of the loop
    print(key + ' ' + str(values.shape))

# round the decimals of columns:
def edit_1g(df):
    d = {'icpwp10bp':0, 'icpwp2bp':0, 'icslippagebpbp':0, 'participationrate':0, 'adv':1, 'twodprioris':0,
         'twodpostis':0, 'orderval':0, 'valuedark':0, 'mktvalflt':0, 'numberoffills': 0, 'size':0,
         'lmtadjintvwap':0, 'fivedsprd':0, 'tendvol':0
        }
    df = df.round(d)
    return df

Can you also list an example of one of your edit functions? Do they take and return a dataframe? — 9769953, Jan 11 '19 at 09:03
See marked duplicates, `values = pipeline_1(values)` just updates a variable `values`. It doesn't update your dictionary. — jpp, Jan 11 '19 at 09:25

score 2 · Accepted Answer · answered Jan 11 '19 at 09:03

2

I guess there's a problem in the last part:

for key, values in dfs.items():
    values = pipeline_1(values)      
    print(key + ' ' + str(values.shape))

You should assign here:

new_dfs = dict()
for key, values in dfs.items():
    values = pipeline_1(values)    
    new_dfs[key] = values 
    print(key + ' ' + str(values.shape))

So, new_dfs constains new dataframes. But this approach will duplicate your data. You can try to assign inplace, instead:

for key, values in dfs.items():
    values = pipeline_1(values)    
    dfs[key] = values 
    print(key + ' ' + str(values.shape))

answered Jan 11 '19 at 09:03

Mikhail Stepanov

3,680
3
23
24

Thank you very much. If I do the above new_dfs['df_tag761'] has the desired results but df_tag761 does not, which is fine, but ideally i'd like to edit df_tag761 directly – newbie78634 Jan 11 '19 at 09:31
That case, you should transform functioins in the pipe (`edit_1g`, `edit_1f` ...) into `inplace` functions (i.e. each function modifies dataframe inplace, than returns it). But IMO it's not so clear, and better to return and re-assign. – Mikhail Stepanov Jan 11 '19 at 09:51

pipe several pandas dataframes

1 Answers1