Group lists of different rows by multiple columns values with Pandas

Question

I have a dataframe df1 like this:

import pandas as pd
dic = {'A':[0,0,2,2,2,1,5,5],'B':[[1,5,3,8],[1,8,7,5],[7,8,9,5],[3],[1,5,9,3],[0,3,5],[],[4,2,3,1]],'C':['a','b','c','c','d','e','f','f'],'D':['0','8','7','6','4','5','2','2']}
df1 = pd.DataFrame(dic)

and looks like this:

#Initial dataframe 
   A             B  C  D
0  0  [1, 5, 3, 8]  a  0
1  0  [1, 8, 7, 5]  b  8
2  2  [7, 8, 9, 5]  c  7
3  2           [3]  c  6
4  2  [1, 5, 9, 3]  d  4
5  1     [0, 3, 5]  e  5
6  5            []  f  2
7  5  [4, 2, 3, 1]  f  2

My goal is to group rows that have the same values in column A and C and merge the content of column B in such a way that the result looks like this:

#My GOAL
   A                B  C
0  0     [1, 5, 3, 8]  a
1  0     [1, 8, 7, 5]  b
2  2  [3, 7, 8, 9, 5]  c
3  2     [1, 5, 9, 3]  d
4  1        [0, 3, 5]  e
5  5     [4, 2, 3, 1]  f

As you can see, rows having the same items in column A and C are merged while if at least one is different they are left as is. My idea was to use the groupby and sum functions like this:

df1.groupby(by=['A','C'],as_index=False,sort=True).sum()

but Python returns an error message: Function does not reduce

Could you please tell me what is wrong with my line of code? What should I write in order to achieve my goal?

Note: I do not care about what happens to column D which can be discarted.

Nickil Maveli · Accepted Answer · 2017-02-01T17:43:41.877

One of the possibilities would be to flatten the list of lists until it gets exhausted with the help of itertools.chain(*iterables)

import itertools
df1.groupby(['A', 'C'])['B'].apply(lambda x: list(itertools.chain(*x))).reset_index()

(Or)

Use sum with lambda:

df1.groupby(by=['A','C'])['B'].apply(lambda x: x.sum()).reset_index()

Both yield:

By default, groupby().sum() looks for numeric types (scalar) values to perform aggregation and not a collection of elements like list for example.

score 1 · Answer 2 · edited May 23 '17 at 12:01

1

Another possibility:

df1.groupby(by=['A','C'],as_index=False,sort=True).agg({'B': lambda x: tuple(sum(x, []))})

Result:

   A  C                B
0  0  a     (1, 5, 3, 8)
1  0  b     (1, 8, 7, 5)
2  1  e        (0, 3, 5)
3  2  c  (7, 8, 9, 5, 3)
4  2  d     (1, 5, 9, 3)
5  5  f     (4, 2, 3, 1)

Based in this answer (it seems that lists do not work too well with aggregation).

edited May 23 '17 at 12:01

Community

1
1

answered Feb 01 '17 at 17:23

jdehesa

58,456
7
77
121

Group lists of different rows by multiple columns values with Pandas

2 Answers2