Merge rows of a dataframe according to values in a specific column by concatenating the strings in the other columns into one

Question

I have a dataframe that looks like this:

 df1 = pd.DataFrame({
                   "Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
                   "Case":[1,1,2,2,2,3,4],
                   "Application":["MicrosoftWord", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftOutlook", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"], 
                   "Activity_of_the_User":["SavingADocument", "SendingAnEmail", "SavingADocument", "SendingAnEmail", "SendingAnEmail", "SavingADocument", "SavingADocument"],
                   "Receiver_email_root":["None", "idatta91 adarandall larryjacob", "None", "idatta91 larryjacob"," vanessaHudgens prithakaur", "None", "None"],
                   "Receiever_email_domains":["None", "gmail yahoo", "None", "gmail", "gmail yahoo", "None", "None"],
                   "Receiver_email_count_Catg":["None", "Few", "None", "Double", "Double", "None", "None"],
                   "Subject":["None","Activity Report", "None", "Project Progress Report", "Project Progress Report 2", "None", "None"]
                   })

I want to merge the rows of the dataframe according to the column Case. So, if the number in the Case column of two or more rows is the same then the strings of the other columns of those rows are concatenated into one row.

Also the the values in the column Business_Process_Activity is the same for the cases of the same number. For that column, I do not want to concatenate the Business_Process_Activity values but keep only one of them since that column needs to be categorical. I want the final dataframe to look like this:

df2 = pd.DataFrame({"Case":[1,2,3,4],
               "Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
               "Application":["MicrosoftWord MicrosoftOutlook", "MicrosoftWord MicrosoftOutlook MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"], 
               "Activity_of_the_User":["SavingADocument SendingAnEmail","SavingADocument SendingAnEmail SendingAnEmail", "SavingADocument", "SavingADocument"],
               "Receiver_email_root":["idatta91 adarandall larryjacob", "idatta91 larryjacob vanessaHudgens prithakaur", "None", "None"],
               "Receiever_email_domains":["gmail yahoo","gmail gmail yahoo", "None", "None"],
               "Receiver_email_count_Catg":["Few", "Double Double", "None", "None"],
               "Subject":["Activity Report", "Project Progress Report Project Progress Report 2", "None", "None"]
               })

If strings are merged with a "None" column, the "None" string should be deleted since that value is not empty anymore. The duplicate number of the case columns should be deleted as the rows are merged into one.

How do I do this? Thanks in advance!

Some thing like this, https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby — sushanth, Jun 07 '20 at 11:55
Not a `machine-learning` or `nlp` question, kindly do not spam irrelevant tags (removed). — desertnaut, Jun 07 '20 at 12:23

jezrael · Accepted Answer · 2020-06-07T12:17:36.367

1

Idea is remove None values and also None strings per groups, join together and last replace empty strings to None:

df = (df1.groupby('Case')
         .agg(lambda x: ' '.join(x[x.ne('None') & x.notna()]))
         .where(lambda x: x.astype(bool), None)
         .reset_index())

Another solution with custom function:

def f(x):
   y = x[x.ne('None') & x.notna()]
   return None if y.empty else ' '.join(y)

df = df1.groupby('Case').agg(f).reset_index()

edited Jun 07 '20 at 12:17

answered Jun 07 '20 at 11:58

jezrael

822,522
95
1,334
1,252

Hi! Your response was helpful! But I forgot to put one important column in the example dataframe called "Business_Process_Activity". For that column I want it to happen differently. I have updated the question again. Can you please review it one more time? Thanks in advance! :) – Indy Jun 07 '20 at 13:13
@Indy - You can change `df1.groupby('Case')` to `df1.groupby(['Business_Process_Activity','Case'])` – jezrael Jun 08 '20 at 05:08

Shubham Sharma · Answer 2 · 2020-06-07T13:56:16.653

1

Use:

g = df1.groupby('Case')
df2 = g.agg(lambda s: ' '.join(s[s.ne('None')] if s.ne('None').any() else ['None']))
df2['Business_Process_Activity'] = g['Business_Process_Activity'].first()
df2 = df2.reset_index()

# print(df2)



   Case  Business_Process_Activity  ... Receiver_email_count_Catg                                            Subject
0     1     SendingReportToManager  ...                       Few                                    Activity Report
1     2     SendingReportToManager  ...             Double Double  Project Progress Report Project Progress Report 2
2     3  PreparingAndSendingAgenda  ...                      None                                               None
3     4  PreparingAndSendingAgenda  ...                      None                                               None

edited Jun 07 '20 at 13:56

answered Jun 07 '20 at 12:06

Shubham Sharma

68,127
6
24
53

Hi! Your response was helpful! But I forgot to put one important column in the example dataframe called "Business_Process_Activity". For that column I want it to happen differently. I have updated the question again. Can you please review it one more time? Thanks in advance! :) – Indy Jun 07 '20 at 12:54
The problem is your answer creates multiple "None" values in the columns where all the values of the rows merged have "None" values. – Indy Jun 07 '20 at 13:35
@Indy I guess you want string repr of None value, edited the answer. – Shubham Sharma Jun 07 '20 at 13:56

Merge rows of a dataframe according to values in a specific column by concatenating the strings in the other columns into one

2 Answers2