552

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Abhishek Thakur
  • 16,337
  • 15
  • 66
  • 97

16 Answers16

743

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 14
    This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500k – Abhishek Thakur Mar 06 '14 at 11:12
  • 13
    groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think – EdChum Mar 06 '14 at 11:32
  • 3
    When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used `tuple`following the second answer here: http://stackoverflow.com/questions/19530568/can-pandas-groupby-aggregate-into-a-list-rather-than-sum-mean-etc . See second answer in https://stackoverflow.com/questions/27439023/pandas-groupby-agg-function-does-not-reduce/37955931#37955931 for explanation. – Andarin Jun 24 '16 at 10:54
  • This solution is good, but is there a way to store set of list, meaning can i remove the duplicates and then store it? – Sriram Arvind Lakshmanakumar Jan 18 '19 at 10:59
  • you mean `df.groupby('a')['b'].apply(lambda x:list(set(x)))` – EdChum Jan 18 '19 at 11:02
  • You don't need to use groupby. Just take a set on column 'a', and do a subset to the dataframe of 'A', 'B', etc. Then fetch column 'b' in the subset and put those values in a list. – Catiger3331 Apr 18 '19 at 14:58
  • But how the code is written if you had another column `c` which also had numbers which had to be put in a list? – Outcast Jun 07 '19 at 15:19
  • 2
    @PoeteMaudit Sorry I don't understand what you're asking and asking questions in comments is bad form in SO. Are you asking how to concatenate multiple columns into a single list? – EdChum Jun 07 '19 at 15:31
  • No worries, I was asking for this basically: https://stackoverflow.com/a/53088007/9024698 – Outcast Jun 07 '19 at 15:32
  • in pandas 0.23.x, apply does not work. I needed to use 'agg' function. – Edward Aung Jul 08 '19 at 00:35
  • @EdwardAung this still works for me using pandas version `'0.24.2'`, you'd have to post an example where this fails – EdChum Jul 08 '19 at 11:06
  • If you make it `df.groupby('a')['b'].apply(list).apply(pd.Series)` you get columns with on entry each instead of one column with lists, , which can be very useful. – ic_fl2 Oct 22 '19 at 06:59
  • Empirically, I found .apply(np.array) to be slightly faster on my 25K dataset. – Dave Liu Nov 12 '19 at 00:39
  • I understand how groupby and apply work in this example. You get SeriesGroupBy object. But can someone explain why calling reset_index on this magically creates a dataframe? – rioZg Feb 07 '21 at 19:11
  • 6
    is there a way to do it multiple columns at a time? `df1 = df.groupby('a')['b','c'].apply(list).reset_index(name='new')` – Modem Rakesh goud Feb 10 '21 at 18:29
  • How can I select a subset of the elements of rows? for example, I have some `nan` values in cells and I don't want to `.apply(list)` add all elements include `nan`. How can I do this? – Prof.Plague Jun 19 '21 at 18:02
  • Is there a way to do the opposite? Where you have the dataframe grouped by and you want to ungroup? – Angel Ortiz May 21 '22 at 14:43
115

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

Anamika Modi
  • 1,328
  • 1
  • 9
  • 5
  • 20
    `lambda args: f(args)` is equivalent to `f` – BallpointBen Oct 11 '18 at 17:43
  • 31
    Actually, just `agg(list)` is enough. Also see [here](https://stackoverflow.com/a/55839464/4909087). – cs95 Jun 07 '19 at 15:31
  • 5
    !! I was just googling for some syntax and realised my own notebook was referenced for the solution lol. Thanks for linking this. Just to add, since 'list' is not a series function, you will have to either use it with apply `df.groupby('a').apply(list)` or use it with agg as part of a dict `df.groupby('a').agg({'b':list})`. You could also use it with lambda (which I recommend) since you can do so much more with it. Example: `df.groupby('a').agg({'c':'first', 'b': lambda x: x.unique().tolist()})` which lets you apply a series function to the col c and a unique then a list function to col b. – Akshay Sehgal Apr 08 '20 at 01:11
73

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
B. M.
  • 18,243
  • 2
  • 35
  • 54
  • 12
    How could we use this if we are grouping by two or more keys e.g. with `.groupby([df.index.month, df.index.day])` instead of just `.groupby('a')`? – ru111 Mar 12 '19 at 17:35
  • @ru111 I have added an [answer](https://stackoverflow.com/a/66018377/5105118) below which you might want to check out. It does also handle grouping with multiple columns – v.tralala Feb 02 '21 at 22:18
61

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58
39

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 2
    are the methods above **guaranteed** to preserve order? meaning that elements from the same row (but different columns, _b_ and _c_ in your code above) will have the same index in the resulting lists? – Kai May 02 '19 at 15:51
  • 2
    @Kai oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use `groupby(..., sort=False)`. Here, it'd make no difference since I'm grouping on column A which is already sorted. – cs95 May 02 '19 at 16:37
  • i'm sorry, i don't understand your answer. Can you explain in more detail. I think this deserves it's own question.. – Kai May 02 '19 at 17:28
  • 2
    This is a very good answer! Is there also a way to make the values of the list unique? something like .agg(pd.Series.tolist.unique) maybe? – Federico Gentile Dec 05 '19 at 12:24
  • 4
    @FedericoGentile you can use a lambda. Here's one way: `df.groupby('a')['b'].agg(lambda x: list(set(x)))` – cs95 Dec 05 '19 at 14:48
  • @cs95 Hi ColdSpeed! Is there a way to get aggregate the list values of each columns list into one column? Instead of separate `b` and `c` columns, instead just create a column that has all the values. Thank you. – Moondra Jun 30 '20 at 22:44
  • 2
    @Moondra Not sure, perhaps you want `df.groupby('a').agg(lambda x: x.to_numpy().ravel().tolist())` – cs95 Jun 30 '20 at 23:07
  • 2
    @Federico Gentile `df.groupby('a')['b'].agg("unique")` – 00schneider Jul 26 '21 at 17:01
  • 2
    This is the best answer. – kimonili Apr 20 '22 at 11:07
28

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4
Acorbe
  • 8,367
  • 5
  • 37
  • 66
26

It is time to use agg instead of apply .

When

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .

Mithril
  • 12,947
  • 18
  • 102
  • 153
18

Just a suplement. pandas.pivot_table is much more universal and seems more convenient:

"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
                    'b':[1,2,5,5,4,6],
                    'c':[1,2,1,1,1,6]})
print(df)

   a  b  c
0  A  1  1
1  A  2  2
2  B  5  1
3  B  5  1
4  B  4  1
5  C  6  6
"""pivot_table"""
pt = pd.pivot_table(df,
                    values=['b', 'c'],
                    index='a',
                    aggfunc={'b': list,
                             'c': set})
print(pt)
           b       c
a                   
A     [1, 2]  {1, 2}
B  [5, 5, 4]     {1}
C        [6]     {6}
Sean.H
  • 640
  • 1
  • 6
  • 18
9

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Vanshika
  • 153
  • 1
  • 3
  • 7
8

Building upon @B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1) And this solution can also deal with multi-indices:

However this is not heavily tested, use with caution.

If performance is important go down to numpy level:

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})


def f_multi(df,col_names):
    if not isinstance(col_names,list):
        col_names = [col_names]
        
    values = df.sort_values(col_names).values.T

    col_idcs = [df.columns.get_loc(cn) for cn in col_names]
    other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
    other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]

    # split df into indexing colums(=keys) and data colums(=vals)
    keys = values[col_idcs,:]
    vals = values[other_col_idcs,:]
    
    # list of tuple of key pairs
    multikeys = list(zip(*keys))
    
    # remember unique key pairs and ther indices
    ukeys, index = np.unique(multikeys, return_index=True, axis=0)
    
    # split data columns according to those indices
    arrays = np.split(vals, index[1:], axis=1)

    # resulting list of subarrays has same number of subarrays as unique key pairs
    # each subarray has the following shape:
    #    rows = number of non-grouped data columns
    #    cols = number of data points grouped into that unique key pair
    
    # prepare multi index
    idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names) 

    list_agg_vals = dict()
    for tup in zip(*arrays, other_col_names):
        col_vals = tup[:-1] # first entries are the subarrays from above 
        col_name = tup[-1]  # last entry is data-column name
        
        list_agg_vals[col_name] = col_vals

    df2 = pd.DataFrame(data=list_agg_vals, index=idx)
    return df2

Tests:

In [227]: %timeit f_multi(df, ['a','d'])

2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [228]: %timeit df.groupby(['a','d']).agg(list)

4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Results:

for the random seed 0 one would get:

enter image description here

v.tralala
  • 1,444
  • 3
  • 18
  • 39
7

The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Metrd
  • 79
  • 1
  • 6
4

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above

For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks

e.g.

import pandas as pd
from string import ascii_lowercase
import random

def generate_string(case=4):
    return ''.join([random.choice(ascii_lowercase) for _ in range(case)])

df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})


%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})

For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s

Shashank
  • 11
  • 1
0

Here I have grouped elements with "|" as a separator

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]
Ganesh Kharad
  • 333
  • 2
  • 6
0

Answer based on @EdChum's comment on his answer. Comment is this -

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think 

Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column 
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))

# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']

# Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)

print(gp_df.shape)
gp_df.head()

This above code takes 2 minutes for 20 million rows and 500k categories in first column.

Abhilash Awasthi
  • 782
  • 5
  • 22
0

Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:

df = pd.DataFrame({
    'a':['A','A','B','B','B','C'], 
    'b':[1,2,5,5,4,6]
})

df=df.groupby('a').agg({
    'b':['min', 'max',lambda x: list(x)]
})

#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
Niyaz
  • 797
  • 1
  • 8
  • 18