df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

Question

I have a dataframe that looks like this:

  a b c
0 x x x
1 y y y
2 z z z

I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:

def my_func(df):
    dup_num = int(df.c - df.a)
    if isinstance(df, pd.Series):
        df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num, 
                                ignore_index=True)
    else:
        df_expanded = pd.concat([pd.DataFrame(df)]*dup_num, 
                                ignore_index=True)
    return df_expanded

The final dataframe will look like something like this:

  a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y 
5 z z z
6 z z z

So I did:

df_expanded = df.apply(my_func, axis=1)

I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:

ValueError: cannot copy sequence with size XX to array axis with dimension YY

As if apply is trying to return a Series not a group of dataFrames that the function created.

So instead of df.apply I did:

df_expanded = df.groupby(df.index).apply(my_func)

Which just creates groups of single rows and applies the same function. This on the other hand works.

Why?

Yes, because `groupby().apply()` is built to handle that situation; whereas `.apply` really expects `scalar/pd.Series`, not `pd.DataFrame`. — Quang Hoang, Sep 19 '22 at 18:41
@QuangHoang could you please elaborate on that? When you say expects, the input 'is' a series since I passed axis=1. The problem is that it seems that the output cannot be a dataframe when we use df.apply(axis=1). Is this true? — Maz, Sep 19 '22 at 18:50
@Maz yes, but the input for `groupby().apply()` is a dataframe, not a series. — Quang Hoang, Sep 19 '22 at 18:55
If you must really loop, can't you use `pd.concat([my_func(r) for _,r in df.iterrows()])`? Although groupby on the index might still be better... — mozway, Sep 19 '22 at 18:58
You can compare the timings, I would expect the groupby to be faster... (let us know if you do) — mozway, Sep 19 '22 at 19:16
Not sure if you only wanted to know why it worked with groupby and not with the apply on the df itself, but you can have a look [here](https://stackoverflow.com/questions/50788508/how-can-i-replicate-rows-in-pandas) on how to duplicate rows of your df. Instead of a fixed value, you could build the the difference of the two columns beforehand like `df['rep']=df.c - df.a` and use that as repetition arguments. — Rabinzel, Sep 19 '22 at 19:52

score 0 · Answer 1 · answered Sep 19 '22 at 22:04

Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.

Given:

Doing:

new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
            .explode(ignore_index=True)
            .apply(pd.Series))
new_df.columns = df.columns
print(new_df)

Output:

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

1 Answers1