Group a dataframe by a column and concactenate strings in another

Question

I know this should be easy but it's driving me mad...

I am trying to turn a dataframe into a grouped dataframe.

df outputs:

    Postcode    Borough             Neighbourhood
0   M3A         North York          Parkwoods
1   M4A         North York          Victoria Village
2   M5A         Downtown Toronto    Harbourfront
3   M5A         Downtown Toronto    Regent Park
4   M6A         North York          Lawrence Heights
5   M6A         North York          Lawrence Manor
6   M7A         Queen's Park        Not assigned
7   M9A         Etobicoke           Islington Avenue
8   M1B         Scarborough         Rouge
9   M1B         Scarborough         Malvern
10  M3B         North York          Don Mills North
...

I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode... something like:

    Postcode    Borough             Neighbourhood
0   M3A         North York          Parkwoods
1   M4A         North York          Victoria Village
2   M5A         Downtown Toronto    Harbourfront, Regent Park
...

I am trying to use:

df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))

But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.

if I use:

df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))

it turns df into an object?

https://stackoverflow.com/questions/18138693/replicating-group-concat-for-pandas-dataframe — Matthew Barlowe, May 30 '19 at 17:45
thanks.. looks like I'm on the right track but I still can't get the dataframe to appear correct. ```df.groupby('Postcode').agg({'Neighbourhood':lambda x:', '.join(x)})``` and then ```df``` still returns an ungrouped dataframe... — M A, May 30 '19 at 18:40
if you don't assign the new dataframe to a new variable it won't. I'm pretty sure group by isn't done in place — Matthew Barlowe, May 30 '19 at 18:41
So it looks like all that will do is create a new dataframe with Postcode as the index but the Neighbourhood looks correct.. need to figure out how to get it back into the original dataframe now.. — M A, May 30 '19 at 18:56
add `.reset_index()` to the end of your chain. Docs can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) — Matthew Barlowe, May 30 '19 at 18:57

Matthew Barlowe · Accepted Answer · 2019-05-30T19:09:30.333

1

Use this code

new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()

reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

edited May 30 '19 at 19:09

answered May 30 '19 at 19:00

Matthew Barlowe

2,229
1
14
24

Thanks! how would I keep the "Borough" column as well? – M A May 30 '19 at 19:08
1

Edited answer to reflect that – Matthew Barlowe May 30 '19 at 19:09

Group a dataframe by a column and concactenate strings in another

1 Answers1