0

I have this minimal sample data:

import pandas as pd
from pandas import Timestamp

data = pd.DataFrame({'Client': {0: "Client_1",  1: "Client_2",  2: "Client_2",  3: "Client_3",  4: "Client_3",  5: "Client_3",  6: "Client_4",  7: "Client_4"},
 'Id_Card': {0: 1,  1: 2,  2: 3,  3: 4,  4: 5,  5: 6,  6: 7,  7: 8},
 'Type': {0: 'A',  1: 'B',  2: 'C',  3: np.nan,  4: 'A',  5: 'B',  6: np.nan,  7: 'B'},
 'Loc': {0: 'ADW',  1: 'ZCW',  2: 'EWC',  3: "VWQ",  4: "OKS",  5: 'EQW',  6: "PKA",  7: 'CSA'},
 'Amount': {0: 10.0,  1: 15.0,  2: 17.0,  3: 32.0,  4: np.nan,  5: 51.0,  6: 38.0,  7: -20.0},
 'Net': {0: 30.0,  1: 42.0,  2: -10.0,  3: 15.0,  4: 98,  5: np.nan,  6: 23.0,  7: -10.0},
 'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data

enter image description here

I'm trying to aggregate this data grouping by Client column. Counting the Id_Card per client, concatenating Type, Loc, separated by ; (e.g. A;B and ZCW;EWC values for Client_2, NOT A;ZCW B;EWC), sum the Amount, Net, per client, and getting the minimum Date per client. However, I'm facing some problems:

  1. These functions works perfectly individually, but I can't find a way to mix the aggregate function and apply function:

Code example:

data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
  1. The apply function doesn't work for columns with missing values:

Code example:

data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
  1. The aggregate and apply functions don't allow me to put multiple columns for one transformation:

Code example:

cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})

cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()

In (3) I only put Amount and Net and I could put them separately in the aggregate function, but I'm looking to a more efficient way as I'm working with plenty of columns.

The output expected is the same dataframe, but aggregated with the conditions outlined at the beggining.

Chris
  • 2,019
  • 5
  • 22
  • 67

2 Answers2

1

Go step by step, and prepare three different data frames to merge them later. First dataframe is for simple functions like count,sum,mean

df1 = data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Net":sum, "Date": "min"}).reset_index()

Next you deal with Type and Loc join, we use fill na to deal with nan values

df2=data[['Client', 'Type']].fillna('').groupby("Client")['Type'].apply(
    ';'.join).reset_index()
df3=data[['Client', 'Loc']].fillna('').groupby("Client")['Loc'].apply(
    ';'.join).reset_index()

And finally you merge the results together:

data_new = df1.merge(df2, on='Client').merge(df3, on='Client')

data_new output:

enter image description here

Artyom Akselrod
  • 946
  • 6
  • 14
1

For doing a join, you would have to filter out the NaN values. As join you have to apply at two places, I have created a separate function

def join_non_nan_values(elements):
    return ";".join([elem for elem in elements if elem == elem])  # elem == elem will fail for Nan values

data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
                            "Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})

enter image description here

ggaurav
  • 1,764
  • 1
  • 10
  • 10