I can't take credit for this, but I'm not sure how to present it outside of a separate answer.
I am also new at Python. Using padu's logic code, I've adapted it to your original post's data.
I also substituted ['B']
with .reset_index()
to keep the original columns, as the sums were previously merged with the newly grouped values in a single column. I added also added numeric_only=True
because without specifying a column to sum, it would normally assume to stick with numeric columns, but the assumption is now a deprecated feature, as per this message I got:
The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum().reset_index()
You may also want to add df =
at the beginning of the last line, as below, to keep the result in the original variable name. Missing this step caused me to lose lots of time while learning to debug.
import pandas as pd
df = pd.DataFrame([
['Direct, Organic Search',171],
['Direct, Organic Search, Not Found',4],
['Direct, Organic Search, Not Found, Referral',12],
['Direct, Organic Search, Not Found, Referral, Email',2],
['Organic Search, Direct',389],
['Organic Search, Direct, Not Found',3],
['Organic Search, Direct, Referral',76],
['Organic Search, Direct, Referral, Social Network',1]],
columns =['MCF Channel Grouping Path_1', 'Conversions'])
def sort_groupings(row_group):
row_group = row_group.split(', ') # Converts string into list using ', ' as the separator
row_group.sort() # Sorts newly, made list in ascending order
return tuple(row_group) # Converts list to a tuple (required by the 'apply' function), and returns it
df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum(numeric_only=True).reset_index()
# groupby: column in which to group common values
# apply: call a function to run on values specified
# reset_index: reinstates the first column in the final df