0

EDIT Based on comments, clarifying the examples further to depict more realistic use case

I want to call a function with df.apply. This function returns multiple DataFrames. I want to join each of these DataFrames into logical groups. I am unable to do that without using for loop (which defeats the purpose of calling with apply).

I have tried calling function for each row of dataframe and it is slower than apply. However, with apply combining the results slows down things again.

Any tips?

# input data frame
data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)

Output>

  Name  Age  Score
0  Ani   15     93
1  Bob   12     98
2  Cal   13     95
3  Dom   14     99

Function to be applied>

def func1(name, age):
    num_rows = np.random.randint(int(age/3))
    age_mul_1 = np.random.randint(low=1, high=age, size = num_rows)
    age_mul_2 = np.random.randint(low=1, high=age, size = num_rows)
    data = {'Name': [name]*num_rows, 'Age_Mul_1': age_mul_1, 'Age_Mul_2': age_mul_2}
    df_func1 = pd.DataFrame(data)
    return df_func1

def func2(name, age, score, other_params):
    num_rows = np.random.randint(int(score/10))
    score_mul_1 = np.random.randint(low=age, high=score, size = num_rows)
    data2 = {'Name': [name]*num_rows, 'score_Mul_1': score_mul_1}
    df_func2 = pd.DataFrame(data2)
    return df_func2
    
def ret_mul_df(row):
    df_A = func1(row['Name'], row['Age'])
    #print(df_A)
    
    df_B = func2(row['Name'], row['Age'], row['Score'],1)
    #print(df_B)
    return df_A, df_B

What I want to do is essentially create is two dataframes df_A_combined and df_B_combined

However, How I am currently combining is as follows:

df_out = df_in.apply(lambda row: ret_mul_df(row), axis=1)
df_A_combined = pd.DataFrame()
df_B_combined = pd.DataFrame()
for ser in df_out:
    df_A_combined = df_A_combined.append(ser[0], ignore_index=True)
    df_B_combined = df_B_combined.append(ser[1], ignore_index=True)
print(df_A_combined)
Name    Age_Mul_1   Age_Mul_2
0   Ani 7   8
1   Ani 1   4
2   Ani 1   8
3   Ani 12  6
4   Bob 9   8
5   Cal 8   7
6   Cal 8   1
7   Cal 4   8
print(df_B_combined)
Name    score_Mul_1
0   Ani 28
1   Ani 29
2   Ani 50
3   Ani 35
4   Ani 84
5   Ani 24
6   Ani 51
7   Ani 28
8   Bob 32
9   Cal 26
10  Cal 70
11  Dom 56
12  Dom 53

How can I avoid the iteration?

The func1, func2 are calls to 3rd party libraries (which are very computation intensive) and several such calls are made. Also dataframes df_A_combined and df_B_combined are not combinable among themselves

Note: This is a much simplified example and splitting the function will lead to lot of redundancies.

Ani
  • 265
  • 1
  • 3
  • 10
  • Can you post what the two final dataframes would look like? It's not clear that you need apply() here. – Jonathan Leon Dec 04 '20 at 20:24
  • Why do you need multiple dataframes? – Paul H Dec 04 '20 at 22:51
  • @JonathanLeon, please see my enhanced example with two final dataframes below. I want to prevent combining dataframes in for loop (as below) as its very heavy – Ani Dec 04 '20 at 22:52
  • You should include that information in the original question. Not as an "answer" – Paul H Dec 04 '20 at 22:54
  • @PaulH, I need multiple dataframes as they are sent downstream for different processing. – Ani Dec 05 '20 at 00:34
  • @PaulH, should I delete my 'answer' and edit original question even now? – Ani Dec 05 '20 at 00:35
  • yes. most definitely. your "answer" is not an answer – Paul H Dec 05 '20 at 00:36
  • After looking at your edits, I don't see a way where you won't have to iterate through something. See this https://stackoverflow.com/questions/24029659/python-pandas-replicate-rows-in-dataframe for adding duplicate rows, then maybe you can apply all your multipliers??? But when your random number of rows is zero, without iterating and telling the code to delete that row from the dataframe, I'm not sure how'd you solve your issue. I looked at grouping (which also loops) and it was slower than your solution. Hopefully someone else can find some better advice. – Jonathan Leon Dec 05 '20 at 02:58
  • @PaulH, deleted my 'non-answer' reply after editing the original question – Ani Dec 05 '20 at 09:51
  • @JonathanLeon, ok thank you for your reply. This combining output dataframes after the apply function is yet somewhat faster than iterating through df_in and calling ret_mul_df and combining results during each iteration – Ani Dec 05 '20 at 09:55

1 Answers1

1

If this isn't what you want, I'll update if you can post what the two dataframes should look like.

data = {'Name':['Ani','Bob','Cal','Dom'], 'Age': [15,12,13,14], 'Score': [93,98,95,99]}
df_in=pd.DataFrame(data)
print(df_in)

df_A = df_in[['Name','Age']]
df_A['Age_Multiplier'] = df_A['Age'] * 3
print(df_A)

     ...: print(df_A)
  Name  Age  Age_Multiplier
0  Ani   15              45
1  Bob   12              36
2  Cal   13              39
3  Dom   14              42

df_B = df_in[['Name','Score']]
df_B['Score_Multiplier'] = df_B['Score'] * 2
print(df_B)

     ...: print(df_B)
  Name  Score  Score_Multiplier
0  Ani     93               186
1  Bob     98               196
2  Cal     95               190
3  Dom     99               198
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
  • yes, I need these df_A, df_B. Like I mentioned I made a very simplified example of actual use case. In 'real' use case only shared column is for example 'Name'. The remaining structure looks very different (including multiple rows for same Name in df_B) – Ani Dec 04 '20 at 21:27
  • If you can post your actual structure (with dummy data if need be), we can help, but it's really not clear what your df outputs are supposed to be. – Jonathan Leon Dec 04 '20 at 22:00