0

I have the following dataframe, groupby objects, and functions.

df = pd.DataFrame({
    'A': 'a a b b b'.split(), 
    'P': 'p p p q q'.split(), 
    'B': [1, 2, 3, 4, 5], 
    'C': [4, 6, 5, 7, 8],
    'D': [9, 10, 11, 12, 13]})

g1 = df.groupby('A')

g2 = df.groupby('P')

def f1(x, y):
    return sum(x) + sum(y)

def f2(x, y):
    return sum(x) - sum(y)

def f3(x, y):
    return x * y

For g1, I want to

  • apply f1 to columns B and C
  • apply f2 to columns C and D.

For g2, I want to

  • apply f2 to columns B and C
  • apply f3 to columns C and D

To me, the difficulty lies in the functions, which operate on multiple columns. I also need the functions to work for any arbitrary set of columns; notice how f2 is used for ['B', 'C'] and ['C', 'D']. I'm struggling with the syntax to deal with this.

How do I use Pandas to do all of these things in Python?

Iterator516
  • 187
  • 1
  • 11
  • Does this answer your question? [Apply multiple functions to multiple groupby columns](https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns) – Amit Vikram Singh Apr 18 '21 at 17:08
  • 1
    Can you share your expected output ? – Psidom Apr 18 '21 at 17:24
  • This is a good example of how to provide useful test data. All too often people do things like "Here's some code that loads a CSV from my hard drive", and there's no way for people trying to answer the question to test their proposed code. – Acccumulation Apr 18 '21 at 18:00
  • @AmitVikramSingh No, it does not. My functions involve operations between any 2 possible columns. That thread uses functions that involve only 1 column at a time. – Iterator516 Apr 18 '21 at 18:47
  • @Iterator516 If you search the seciont `Using apply and returning a Series` in the answer of @TedPetrou, there he is using multiple columns. – Amit Vikram Singh Apr 18 '21 at 18:55
  • @Iterator516 What do you mean by `add 'E' to the grouped dataframe from g2`? Can you post the expected output – Amit Vikram Singh Apr 18 '21 at 18:56
  • @AmitVikramSingh Thanks for clarifying. To elaborate, my functions need to work for any arbitrary set of columns. Notice that I need to use f2 for both ['B', 'C'] and ['C', 'D']. That example from TedPetrou does not show what I want to do. – Iterator516 Apr 18 '21 at 19:05
  • That answer has `d['c_d_prodsum'] = (x['c'] * x['d']).sum()` using two columns `c` and `d`. – Amit Vikram Singh Apr 18 '21 at 19:11
  • @AmitVikramSingh Yes, but that function is specific to the columns 'c', and 'd' for that particular dataframe, d. How do I write a function that works for any 2 arbitrary columns? – Iterator516 Apr 18 '21 at 19:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231287/discussion-between-iterator516-and-amit-vikram-singh). – Iterator516 Apr 18 '21 at 19:19

1 Answers1

1

I don't know if there's a simpler way to do it, but one way is to use currying. I wasn't able to find a way to use the groupby structure to add a column (the structures involved are designed around non-mutable data), so I just dealt with the data in the groupby object directly. You can see whether the following code does what you want:

def sum_curry(x, y):
    return lambda df: sum(df[x]) + sum(df[y])

def diff_curry(x, y):
    return lambda df: sum(df[x]) - sum(df[y])

def append_prod(df):
    df['E'] = df['C']*df['D']
    return df
   
g1_sums = g1.apply(sum_curry('B','C'))
g1_diffs = g1.apply(diff_curry('C','D'))
g2_diffs = g2.apply(diff_curry('B','C'))
g2_with_prod = [(group[0], append_prod(group[1])) for group in g2]
Acccumulation
  • 3,491
  • 1
  • 8
  • 12
  • Thanks for your detailed reply, but g2_with_prod differs from what I expect. I edited my question to include my expected output above. What is the source of our disagreement? – Iterator516 Apr 18 '21 at 22:00
  • @Iterator516 In your "desired output" screenshot, you have D as having 30 and 25, but I don't see those numbers in the example data that you're using. – Acccumulation Apr 18 '21 at 22:07
  • Here is how I arrived at those numbers: For G2, I'm grouping by the column "P", so I'm adding the first 3 numbers for "p" and the last 2 numbers for "q". 9 + 10 + 11 = 30. 12 + 13 = 25. – Iterator516 Apr 18 '21 at 22:16
  • @Iterator516 In Pandas, `groupby` creates an object used for aggregation. It is not aggregation itself. If you want it aggregated by sum, you have to tell Pandas that. It sound like `aggregated_df = df.groupby('P').sum()` and then `aggregated_df['E'] = aggregated_df['C']*aggregated_df['D']` gets what you want. – Acccumulation Apr 18 '21 at 22:26
  • @Accumulation Ah - OK. Thanks for correcting my understanding! – Iterator516 Apr 19 '21 at 00:01
  • For my original question's of g2 and f3, could you please tell me how you interpreted that operation in plain English? – Iterator516 Apr 19 '21 at 01:10
  • I have finally understood why I was wrong and what groupby truly does. I posted this thread and got some good answers. https://stackoverflow.com/questions/67162749/confused-about-meaning-of-groupby-operation-with-multiple-columns-with-pandas-in/67164242 – Iterator516 Apr 19 '21 at 16:23