1

I have been looking for an answer without success (1,2,3) and a lot of the questions I have found about string aggregation involves only string aggregation when all the columns are strings. This is a mixed aggregation with some specific details.

The df:

df = pd.DataFrame({
    'Group': ['Group_1', 'Group_1','Group_1', 'Group_1', 'Group_2', 'Group_2', 'Group_2', 'Group_2', 'Group_2', 'Group_2'],
    'Col1': ['A','A','B',np.nan,'B','B','C','C','C','C'],
    'Col2': [1,2,3,3,5,5,5,7,np.nan,7],
    'Col3': [np.nan, np.nan, np.nan,np.nan,3,2,3,4,5,5],
    'Col4_to_Col99': ['some value','some value','some value','some value','some value','some value','some value','some value','some value','some value']
})

enter image description here

The function used for the output (source):

def join_non_nan_values(elements):    
    return ";".join([elem for elem in elements if elem == elem])  # elem == elem will fail for Nan values

The output:

df.groupby('Group')[['Col1', 'Col2', 'Col3']].agg({'Col1': join_non_nan_values, 'Col2': 'count', 'Col3':'mean'})

enter image description here

The output expected:

enter image description here

The output for Col1 and Col2 is a counting. The left side is the value, the right side is the count.

PD: If you know a more efficient way to implement join_non_nan_values function, you are welcome! (As it takes a while for it to run actually..) Just remember that it needs to skips missing values

cottontail
  • 10,268
  • 18
  • 50
  • 51
Chris
  • 2,019
  • 5
  • 22
  • 67

2 Answers2

1

You can try calling value_counts() inside groupby().apply() and convert the outcome into strings using the str.join() method. To have a Frame (not a Series) returned as an output, use as_index=False parameter in groupby().

def func(g):
    """
    (i)  Count the values in Col1 and Col2 columns by calling value_counts() on each column 
         and convert the output into strings via join() method
    (ii) Calculate mean of Col3
    """
    
    col1 = ';'.join([f'{k}:{v}' for k,v in g['Col1'].value_counts(sort=False).items()])
    col2 = ';'.join([f'{int(k)}:{v}' for k,v in g['Col2'].value_counts(sort=False).items()])
    col3 = g['Col3'].mean()
    return col1, col2, col3

# group by Group and apply func to specific columns
result = df.groupby('Group', as_index=False)[['Col1','Col2','Col3']].apply(func)
result

res

cottontail
  • 10,268
  • 18
  • 50
  • 51
1

You can try this:

def f(x):
    c = x.value_counts().sort_index()
    return ";".join(f"{k}:{v}" for (k, v) in c.items())

df["Col2"] = df["Col2"].astype('Int64')
df.groupby("Group")[["Col1", "Col2", "Col3"]].agg({
    "Col1": f,
    "Col2": f,
    "Col3": 'mean'
})

It gives:

         Col1         Col2      Col3
Group                                  
Group_1  A:2;B:1  1:1;2:1;3:2       NaN
Group_2  B:2;C:4      5:3;7:2  3.666667
bb1
  • 7,174
  • 2
  • 8
  • 23