How can I perform the following operations on a pandas dataframe?
- combine text from one column, multiple rows into one row
- remove duplicates in the "one row"
- repeat 1 & 2 for multiple columns
Based on the following Stack Overflow questions and answers, I have made the attempted code below. The last attempt is close, but I do not know how to convert the set back into a string (i.e., remove the braces) and roll it into a lambda function that I can use applymap() for multiple columns.
- How to combine multiple rows into a single row with pandas [duplicate]
- Concatenate strings from several rows using Pandas groupby
- Remove duplicates from rows and columns (cell) in a dataframe, python
Example Dataframe
id = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
colA = ['type12', 'type11', 'type11', 'type11', 'type21', 'type21',
'type22', 'type23', 'type23', 'type23', 'type31', 'type31',
'type31', 'type31', 'type41', 'type41', 'type42', 'type41',
'type41', 'type43'
]
colB = ['Set A', 'Set B', 'Set B', 'Set B', 'Set B', 'Set B', 'Set A',
'Set B', 'Set C', 'Set C', 'Set B', 'Set C', 'Set B', 'Set C',
'Set B', 'Set B', 'Set A', 'Set C', 'Set B', 'Set A'
]
colC = ['alpha', 'beta', 'delta', 'charlie', 'beta', 'delta', 'alpha',
'charlie', 'charlie', 'delta', 'delta', 'charlie', 'beta',
'delta', 'beta', 'charlie', 'alpha', 'charlie', 'delta', 'alpha'
]
df = pd.DataFrame(list(zip(id, colA, colB, colC)), columns =['id', 'colA', 'colB', 'colC'])
print(df)
id colA colB colC
0 1 type12 Set A alpha
1 1 type11 Set B beta
2 1 type11 Set B delta
3 1 type11 Set B charlie
4 2 type21 Set B beta
5 2 type21 Set B delta
6 2 type22 Set A alpha
7 2 type23 Set B charlie
8 2 type23 Set C charlie
9 2 type23 Set C delta
10 3 type31 Set B delta
11 3 type31 Set C charlie
12 3 type31 Set B beta
13 3 type31 Set C delta
14 4 type41 Set B beta
15 4 type41 Set B charlie
16 4 type42 Set A alpha
17 4 type41 Set C charlie
18 4 type41 Set B delta
19 4 type43 Set A alpha
Desired Output
id colA colB colC
1 type11 Set B beta, delta, charlie
1 type12 Set A alpha
2 type21 Set B beta, delta
2 type22 Set A alpha
2 type23 Set B, Set C charlie, delta
3 type31 Set B, Set C beta, delta, charlie
4 type41 Set B, Set C beta, delta, charlie
4 type42 Set A alpha
4 type43 Set A alpha
Attempt for one column only that returns a string but still has duplicates
df2 = df.groupby(['id', 'colA'])['colB'].apply(', '.join).reset_index()
print(df2)
id colA colB
0 1 type11 Set B, Set B, Set B
1 1 type12 Set A
2 2 type21 Set B, Set B
3 2 type22 Set A
4 2 type23 Set B, Set C, Set C
5 3 type31 Set B, Set C, Set B, Set C
6 4 type41 Set B, Set B, Set C, Set B
7 4 type42 Set A
8 4 type43 Set A
Attempt for one column only that removes duplicates but returns a set
df2 = df.groupby(['id', 'colA'])['colB'].apply(list).apply(set).reset_index()
print(df2)
id colA colB
0 1 type11 {Set B}
1 1 type12 {Set A}
2 2 type21 {Set B}
3 2 type22 {Set A}
4 2 type23 {Set B, Set C}
5 3 type31 {Set B, Set C}
6 4 type41 {Set B, Set C}
7 4 type42 {Set A}
8 4 type43 {Set A}