3

Say I have one dataframe

import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))

Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.

def f(row):
    return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
                             y=[row['a']**2, row['b']**2]))

I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?

output_df = pd.concat([f(row) for _, row in input_df.iterrows()])

I thought I should be able to use apply or similar for this purpose but nothing seemed to work.

   x  y
0  2  1
1  3  4
0  6  4
1  5  9
Epimetheus
  • 1,119
  • 1
  • 10
  • 19

2 Answers2

1

You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.

out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
                    'y': np.square(df).to_numpy().reshape(-1)})
out

   x  y
0  2  1
1  3  4
2  6  4
3  5  9
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • thanks for the answer. This doesn't look like it is going to work for more complex `f`. Is there an idiomatic way for more general functions? I'll clarify in the question. – Epimetheus Jul 26 '20 at 15:49
0

Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).

Instead try:

df = df.assign(product=df.a*df.b, sum=df.sum(axis=1), 
    asq=df.a**2, bsq=df.b**2)

Then:

df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • thanks for the answer. This doesn't look like it is going to work for more complex `f`. Is there an idiomatic way for more general functions? I'll clarify in the question – Epimetheus Jul 26 '20 at 15:49
  • 1
    Here is an excellent answer https://stackoverflow.com/a/54028200/6692898 – RichieV Jul 26 '20 at 16:02
  • I've had a look at that and it does cover a lot of interesting topics. However the main focus seems to be performance. A lot of pandas stackoverflow seems to obsess over performance. I believe writing simple understandable idiomatic code should be the primary aim. If performance is unacceptable then by all means optimise but it shouldn't be the primary objective. However I guess you are pointing out that the list comprehension and `pd.concat` is not so bad and might even be idiomatic. Thanks. – Epimetheus Jul 26 '20 at 16:31
  • 1
    If you want to use different and dynamically selected functions then np.vectorize is your solution. Performance is just as important as readability in many cases, why would you choose the slower option if it works just as well? – RichieV Jul 26 '20 at 16:36