pandas groupby and join lists

Question

I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:

column_a, column_b
1,         [1,2,3]
1,         [2,5]
2,         [5,6]

after the process:

column_a, column_b
1,         [1,2,3,2,5]
2,         [5,6]

I want to keep all the duplicates. I have the following questions:

The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?

Thanks in advance.

score 91 · Accepted Answer · edited Apr 09 '21 at 14:52

91

object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.

You want

In [63]: df
Out[63]: 
   a          b    c
0  1  [1, 2, 3]  foo
1  1     [2, 5]  bar
2  2     [5, 6]  baz


In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]: 
         c                b
a                          
1  foo bar  [1, 2, 3, 2, 5]
2      baz           [5, 6]

This groups the data frame by the values in column a. Read more about groupby.

This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]

edited Apr 09 '21 at 14:52

Partha D.

244
5
19

answered May 21 '14 at 21:38

TomAugspurger

28,234
8
86
69

1

thanks, I have another question: what if I have a third column that is string type, and I want to ' '.join them by column one's group. What should I do? – fast tooth May 22 '14 at 01:52
Strings are like lists of characters. So `.sum()` should work there as well. – TomAugspurger May 22 '14 at 01:54
Yes, but I want a ' ' (whitespace) between each concatenated string. – fast tooth May 22 '14 at 01:57
I hope i can upvote more, thank! that lambda function is applied on lists? [ 'foo', 'bar'], and ['baz'] ? – fast tooth May 22 '14 at 02:33
4

why not just `'c': ' '.join` – modesitt Mar 03 '20 at 00:48

score 24 · Answer 2 · answered May 21 '14 at 21:40

24

df.groupby('column_a').agg(sum)

This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:

answered May 21 '14 at 21:40

qwwqwwq

6,999
2
26
49

2

Nice solution (realizing after 7 years). A corollary observation: when `sum`ming up lists _and_ numbers at the same time, the single `.agg(sum)` only sums up the numeric cols. In that case the dictionary `{'a': sum, 'b': sum}` --thanks @TomAugspurger-- works perfect ! – Partha D. Apr 09 '21 at 07:46

mozway · Answer 3 · 2022-09-15T13:46:46.253

The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.

For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:

df = pd.DataFrame({'column_a': ['1', '1', '2'],
                   'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})

itertools.chain:

from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
         .agg(lambda x: list(chain.from_iterable(x)))
       )

list comprehension:

out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
         .agg(lambda x: [e for l in x for e in l])
      )

output:

  column_a         column_b
0        1  [1, 2, 3, 2, 5]
1        2           [5, 6]

Comparison of speed

Using n repeats of the example to show the impact of the number of lists to merge:

test_df = pd.concat([df]*n, ignore_index=True)

NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).

score 3 · Answer 4 · answered May 12 '21 at 15:02

The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.

If you want to convert all of the lists items into integers, you can use:

df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))

score 0 · Answer 5 · answered Jul 10 '20 at 08:23

Use numpy and simple "for" or "map":

import numpy as np

u_clm = np.unique(df.column_a.values)
all_lists = []

for clm in u_clm:
    df_process = df.query('column_a == @clm')
    list_ = np.concatenate(df.column_b.values)
    all_lists.append((clm, list_.tolist()))

df_sum_lists = pd.DataFrame(all_lists)

It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.

score -1 · Answer 6 · answered Dec 01 '22 at 19:06

-1

Thanks, helped me merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})

answered Dec 01 '22 at 19:06

Deeksha Shandilya

15
2

pandas groupby and join lists

6 Answers6

Comparison of speed

Linked

Related