1

I have the following Pandas dataframe:

A        B         C
A        A         Test1
A        A         Test2
A        A         XYZ
A        B         BA
A        B         AB
B        A         AA

I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:

A        A        Test1,Test2,XYZ
A        B        AB, BA
B        A        AA

And the final result should be:

A        A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B        B,A:(AA)

Concatenating itself works, however the sorting does not seem work.

Can anyone help me with this problem?

Kind regards.

bublitz
  • 888
  • 2
  • 11
  • 21

2 Answers2

3

Using groupby + join

s1=df.groupby(['A','B']).C.apply(','.join)

s1
Out[421]: 
A  B
A  A    Test1,Test2,XYZ
   B              BA,AB
B  A                 AA
Name: C, dtype: object

s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]: 
A
A    {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B                                   {('B', 'A'): 'AA'}
dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
  • The problem is, that the concatenated list should be sorted, i.e.: first 'A,A' and then 'A,B' for row A , but also first 'AB' and then 'BA' (concatenated list after A,B) – bublitz Aug 09 '18 at 20:29
  • @bublitz do sort before gourpby ,df=df.sort_values('C') – BENY Aug 09 '18 at 20:44
1

First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:

df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
   A  B                C    D
0  A  A  Test1,Test2,XYZ  A,A
1  A  B            AB,BA  A,B
2  B  A               AA  B,A

s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
   A                                         val
0  A  {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1  B                               {'B,A': 'AA'}

If need tuples only change first part of code:

df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
   A  B                    C    D
0  A  A  (Test1, Test2, XYZ)  A,A
1  A  B             (AB, BA)  A,B
2  B  A                (AA,)  B,A

s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
   A                                                val
0  A  {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1  B                                   {'B,A': ('AA',)}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Downvoter, if there's something wrong with my answer, please let me know, so I can correct it. Thanks. – jezrael Aug 09 '18 at 15:55
  • The problem is, that the concatenated list should be sorted, i.e.: first 'A,A' and then 'A,B' for row A , but also first 'AB' and then 'BA' (concatenated list after A,B) – bublitz Aug 09 '18 at 20:27
  • You can use `df.sort_values(['A','B','C'])`, check edited answer. – jezrael Aug 10 '18 at 04:14