0

I have a dataframe that looks like this:

 df1 = pd.DataFrame({
                   "Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
                   "Case":[1,1,2,2,2,3,4],
                   "Application":["MicrosoftWord", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftOutlook", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"], 
                   "Activity_of_the_User":["SavingADocument", "SendingAnEmail", "SavingADocument", "SendingAnEmail", "SendingAnEmail", "SavingADocument", "SavingADocument"],
                   "Receiver_email_root":["None", "idatta91 adarandall larryjacob", "None", "idatta91 larryjacob"," vanessaHudgens prithakaur", "None", "None"],
                   "Receiever_email_domains":["None", "gmail yahoo", "None", "gmail", "gmail yahoo", "None", "None"],
                   "Receiver_email_count_Catg":["None", "Few", "None", "Double", "Double", "None", "None"],
                   "Subject":["None","Activity Report", "None", "Project Progress Report", "Project Progress Report 2", "None", "None"]
                   })

I want to merge the rows of the dataframe according to the column Case. So, if the number in the Case column of two or more rows is the same then the strings of the other columns of those rows are concatenated into one row.

Also the the values in the column Business_Process_Activity is the same for the cases of the same number. For that column, I do not want to concatenate the Business_Process_Activity values but keep only one of them since that column needs to be categorical. I want the final dataframe to look like this:

df2 = pd.DataFrame({"Case":[1,2,3,4],
               "Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
               "Application":["MicrosoftWord MicrosoftOutlook", "MicrosoftWord MicrosoftOutlook MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"], 
               "Activity_of_the_User":["SavingADocument SendingAnEmail","SavingADocument SendingAnEmail SendingAnEmail", "SavingADocument", "SavingADocument"],
               "Receiver_email_root":["idatta91 adarandall larryjacob", "idatta91 larryjacob vanessaHudgens prithakaur", "None", "None"],
               "Receiever_email_domains":["gmail yahoo","gmail gmail yahoo", "None", "None"],
               "Receiver_email_count_Catg":["Few", "Double Double", "None", "None"],
               "Subject":["Activity Report", "Project Progress Report Project Progress Report 2", "None", "None"]
               })

If strings are merged with a "None" column, the "None" string should be deleted since that value is not empty anymore. The duplicate number of the case columns should be deleted as the rows are merged into one.

How do I do this? Thanks in advance!

Indy
  • 135
  • 1
  • 8
  • Some thing like this, https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby – sushanth Jun 07 '20 at 11:55
  • Not a `machine-learning` or `nlp` question, kindly do not spam irrelevant tags (removed). – desertnaut Jun 07 '20 at 12:23

2 Answers2

1

Idea is remove None values and also None strings per groups, join together and last replace empty strings to None:

df = (df1.groupby('Case')
         .agg(lambda x: ' '.join(x[x.ne('None') & x.notna()]))
         .where(lambda x: x.astype(bool), None)
         .reset_index())

Another solution with custom function:

def f(x):
   y = x[x.ne('None') & x.notna()]
   return None if y.empty else ' '.join(y)

df = df1.groupby('Case').agg(f).reset_index()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi! Your response was helpful! But I forgot to put one important column in the example dataframe called "Business_Process_Activity". For that column I want it to happen differently. I have updated the question again. Can you please review it one more time? Thanks in advance! :) – Indy Jun 07 '20 at 13:13
  • @Indy - You can change `df1.groupby('Case')` to `df1.groupby(['Business_Process_Activity','Case'])` – jezrael Jun 08 '20 at 05:08
1

Use:

g = df1.groupby('Case')
df2 = g.agg(lambda s: ' '.join(s[s.ne('None')] if s.ne('None').any() else ['None']))
df2['Business_Process_Activity'] = g['Business_Process_Activity'].first()
df2 = df2.reset_index()

# print(df2)



   Case  Business_Process_Activity  ... Receiver_email_count_Catg                                            Subject
0     1     SendingReportToManager  ...                       Few                                    Activity Report
1     2     SendingReportToManager  ...             Double Double  Project Progress Report Project Progress Report 2
2     3  PreparingAndSendingAgenda  ...                      None                                               None
3     4  PreparingAndSendingAgenda  ...                      None                                               None
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Hi! Your response was helpful! But I forgot to put one important column in the example dataframe called "Business_Process_Activity". For that column I want it to happen differently. I have updated the question again. Can you please review it one more time? Thanks in advance! :) – Indy Jun 07 '20 at 12:54
  • The problem is your answer creates multiple "None" values in the columns where all the values of the rows merged have "None" values. – Indy Jun 07 '20 at 13:35
  • @Indy I guess you want string repr of None value, edited the answer. – Shubham Sharma Jun 07 '20 at 13:56