Applying methods to multiple datasets in pandas

Question

I would like to use the .assign method with multiple lambda functions to multiple datasets. So far, I've tried with a for loop without success:

a = pd.DataFrame({'a': np.arange(5),
                  'b': np.arange(5)})

b = pd.DataFrame({'a': np.arange(5,10),
                  'b': np.arange(5,10)})

for data in [a,b]:
    data.assign(c = lambda x: x.a+x.b,
                d = lambda x: x.a^x.b)

Edit:

The following doesn't work either:

for data in [a,b]:
    data = data.assign(c = lambda x: x.a+x.b,
                d = lambda x: x.a^x.b)

That doesn't work because `asign` doesn't modify the existing dataframe in place, but instead return a new dataframe object. — cglacet, Mar 22 '19 at 15:10
I guess that in practice you want a solution that works for any number of dataframes? — cglacet, Mar 22 '19 at 15:13
Check out this answer https://stackoverflow.com/questions/38297292/apply-a-for-loop-to-multiple-dataframes-in-pandas — pistolpete, Mar 22 '19 at 15:16

cglacet · Accepted Answer · 2019-03-22T17:29:46.467

The main reason why this doesn't work is that asign doesn't modify the existing dataframe in place, but instead return a new dataframe object.

What you want to do is to apply the same function to several objects, that's exactly what the map function is made for:

def assign(df):
    return df.assign(c = lambda x: x.a+x.b,
                     d = lambda x: x.a^x.b)

(a, b) = map(assign, (a,b))

A more general solution is the following:

# Imagine we don't have control over the following line of code:
dataframes = (a, b)

# We can still use the same solution: 
dataframes = tuple(map(assign, dataframes))
print(dataframes[0])

Concerning your edit, the reason why this doesn't work is a bit more interesting. It may not seem obvious in your code, but it will be in this one:

a = [1, 2, 3]
data = a
data = [4, 5, 6]
print(data)

Here there it is clear that this output [4, 5, 6] and not [1, 2, 3].

What happen in both your code and this last one is the same:

data = a: data is binded to the same object as a (resp. b)
data = ...: creates a new binding, leaving the existing binding of a untouched (as data was only binded to the same object as a, data never was a).

In the end, for data in [a, b]: doesn't mean that data will be an alias for a (resp. b) during the next iteration. (Which is what you may expect when writing this.) Instead for data in [a, b]: simply is equivalent to:

data = a
# 1st iteration
data = b
# 2nd iteration

thanks! I edited the question because I forgot to put a `data = data.assign...` — jcp, Mar 22 '19 at 16:12
You should let your code as it was otherwise people reading the answer and the question won't understand what is going on ^^ — cglacet, Mar 22 '19 at 16:14
I edited too, to have a full answer to this question. I hope it makes sense, let me know if you have any question as I think this is an interesting question/answer to get right and clear. — cglacet, Mar 22 '19 at 16:49

Applying methods to multiple datasets in pandas

1 Answers1