0

I have a dataframe like:

Task  ID  Value  Sno
 A    A1    5     1
 A    A1    2     2
 B    A1    4     3
 A    A1    1     4
 A    B1    10    5
 C    B1    3     6
 D    B1    5     7
 D    B1    2     8
 D    B1    12    9
 E    C1    25    10

And the expected output should look like

Task  ID  Value  Sno
 A    A1    7    1,2
 B    A1    4     3
 A    A1    1     4
 A    B1    10    5
 C    B1    3     6
 D    B1    19   7,8,9
 E    C1    25    10

So, for each record where ID is same, the same consecutive 'Task' has to be merged and the 'Value' to be added. My approach is listed below but with that, it is not considering the cases where the same 'Task' occurs more than twice (the 'D' task in the example). Also, not able to drop the rows after merging, don't know why.

    for j in range(len(df)-1):
    if(df.loc[j,'Task']==df.loc[j+1, 'Task'] and df.loc[j,'ID']==df.loc[j+1, 'ID']):
        df.loc[j, 'Value'] = df.loc[j, 'Value'] + df.loc[j+1, 'Value']
        df.loc[j, 'ID'] = str(df.loc[j, 'ID']) + "," + str(df.loc[j+1, 'ID'])
        df.drop([j+1])
Xavier
  • 113
  • 1
  • 8
  • Duplicate Task A in a row is a typo? – crissal May 16 '21 at 13:49
  • Does this answer your question? [How to group dataframe rows into list in pandas groupby](https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby) – roganjosh May 16 '21 at 13:50
  • @crissal Nope the data follows same format – Xavier May 16 '21 at 14:04
  • @roganjosh I have seen the post and similar ones too, but in those, grouping is done by all instances, what I want is only for consecutive records like mentioned – Xavier May 16 '21 at 14:07

1 Answers1

1

Pretty similar to a groupby and aggregate join. However you have to first create a helper column to identify the consecutive groups:

u = df[['Task','ID']]
g = u.ne(u.shift()).any(1).cumsum()

d = {"Value":"sum","Sno":lambda x: ','.join(x.astype(str))}
#d = {"Value":"sum","Sno":lambda x: ','.join([str(i) for i in x])}

out = df.groupby(['Task','ID',g]).agg(d).droplevel(-1).reset_index()

print(out)

  Task  ID  Value    Sno
0    A  A1      7    1,2
1    A  A1      1      4
2    A  B1     10      5
3    B  A1      4      3
4    C  B1      3      6
5    D  B1     19  7,8,9
6    E  C1     25     10
anky
  • 74,114
  • 11
  • 41
  • 70
  • this answers the question perfectly, but is there a way the order of the data can be preserved as it is of importance, current;y the output is sorted based on "Task". Also, how to include other columns apart from those mentioned (their values are the same for the rows being merged)? – Xavier May 16 '21 at 14:56
  • @Xavier you can pass `sort=False` in groupby `df.groupby(['Task','ID',g],sort=False)....` – anky May 16 '21 at 14:57
  • okay got it, and can you guide on how to include other columns in the resultset as well – Xavier May 16 '21 at 15:29
  • @Xavier if they are not a grouper column, you would have to pass some sort of aggregate function in the dictionary `{...., 'Colx','first'}` for example – anky May 16 '21 at 15:57