0

I have a dataframe that looks like this:

  a b c
0 x x x
1 y y y
2 z z z 

I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:

def my_func(df):
    dup_num = int(df.c - df.a)
    if isinstance(df, pd.Series):
        df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num, 
                                ignore_index=True)
    else:
        df_expanded = pd.concat([pd.DataFrame(df)]*dup_num, 
                                ignore_index=True)
    return df_expanded

The final dataframe will look like something like this:

  a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y 
5 z z z
6 z z z

So I did:

df_expanded = df.apply(my_func, axis=1)

I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:

ValueError: cannot copy sequence with size XX to array axis with dimension YY

As if apply is trying to return a Series not a group of dataFrames that the function created.

So instead of df.apply I did:

df_expanded = df.groupby(df.index).apply(my_func)

Which just creates groups of single rows and applies the same function. This on the other hand works.

Why?

Maz
  • 45
  • 5
  • 1
    Could you please add the function `my_func` ? – Rabinzel Sep 19 '22 at 18:37
  • 1
    Yes, because `groupby().apply()` is built to handle that situation; whereas `.apply` really expects `scalar/pd.Series`, not `pd.DataFrame`. – Quang Hoang Sep 19 '22 at 18:41
  • @Rabinzel I added an example. – Maz Sep 19 '22 at 18:49
  • @QuangHoang could you please elaborate on that? When you say expects, the input 'is' a series since I passed axis=1. The problem is that it seems that the output cannot be a dataframe when we use df.apply(axis=1). Is this true? – Maz Sep 19 '22 at 18:50
  • Could you please add the function `my_func`? – Rabinzel Sep 19 '22 at 18:53
  • @Maz yes, but the input for `groupby().apply()` is a dataframe, not a series. – Quang Hoang Sep 19 '22 at 18:55
  • If you must really loop, can't you use `pd.concat([my_func(r) for _,r in df.iterrows()])`? Although groupby on the index might still be better... – mozway Sep 19 '22 at 18:58
  • @mozway that worked too (as expected) – Maz Sep 19 '22 at 19:06
  • You can compare the timings, I would expect the groupby to be faster... (let us know if you do) – mozway Sep 19 '22 at 19:16
  • Not sure if you only wanted to know why it worked with groupby and not with the apply on the df itself, but you can have a look [here](https://stackoverflow.com/questions/50788508/how-can-i-replicate-rows-in-pandas) on how to duplicate rows of your df. Instead of a fixed value, you could build the the difference of the two columns beforehand like `df['rep']=df.c - df.a` and use that as repetition arguments. – Rabinzel Sep 19 '22 at 19:52

1 Answers1

0

Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.

Given:

   a  b  c
0  1  1  4
1  2  2  4
2  3  3  4

Doing:

new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
            .explode(ignore_index=True)
            .apply(pd.Series))
new_df.columns = df.columns
print(new_df)

Output:

   a  b  c
0  1  1  4
1  1  1  4
2  1  1  4
3  2  2  4
4  2  2  4
5  3  3  4
BeRT2me
  • 12,699
  • 2
  • 13
  • 31