1

I currently have a dataset that I want to squash. For example, imagine that one column is mother and the other is child. Mothers might have more than one child, but each child only has one mother. So in my ideal output, there would be one row per mother, with the mother as one column and a list of children as the other. My current data instead has one row per child, with the mother duplicated.

I can do this in R, based on this very useful blog post. So for example:

library(data.table)
dt <- data.table(mother = c("Penny", "Penny", "Anya", "Sam", "Sam", "Sam"), 
                 child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"))
dt[, children := .(list(unique(child))), by = mother]
dt[, child := NULL]
dt <- unique(dt, by = "mother")

My output looks like this:

   mother        children
1:  Penny        Violet,Prudence
2:  Anya         Erika
3:  Sam          Jake,Wolf,Red

However, I am working in Python with pandas and I can't figure out something equivalent. I can see that it is possible to have a list in a pandas dataframe cell, but I'm not sure how to manipulate the existing data to produce this. Perhaps pandas is not the right thing to use but it seems like it would be.

Here's my dataframe as things stand:

import pandas as pd
data = {'mother':["Penny", "Penny", "Anya", "Sam", "Sam", "Sam"],'child':["Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"]}
df = pd.DataFrame(data)

Apologies if this is basic - I am new to Python and can't find the answer on here.

M--
  • 25,431
  • 8
  • 61
  • 93
Jaccar
  • 1,720
  • 17
  • 46

2 Answers2

3

You can GroupBy the column mother and aggregate with a join using ',' as a separator:

df.groupby('mother').agg(','.join).reset_index()

   mother        child
0   Anya            Erika
1  Penny  Violet,Prudence
2    Sam    Jake,Wolf,Red

For lists instead of strings you can do:

df.groupby('mother').child.apply(list).reset_index()

   mother           child
0   Anya             [Erika]
1  Penny  [Violet, Prudence]
2    Sam   [Jake, Wolf, Red] 
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks! Will values in the `child` column be treated as a list? It looks to me like this would combine them into a string, so the type is an object rather than a list. I still want to be able to manipulate the children-data in a list-like way, so via indexing for example...does that make sense? – Jaccar Mar 15 '19 at 16:20
0

You can also try this pivot solution. This will create a dataframe with the name of children in different column

df = pd.DataFrame({'mother':("Penny", "Penny", "Anya", "Sam", "Sam", "Sam"),'child': ("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red")}) 


df_group1 = df.pivot(index  = 'mother', values = 'child', columns = 'child').reset_index()
df_group1.columns = range(df_group1.shape[1])
Devarshi Mandal
  • 703
  • 8
  • 16