I currently have a dataset that I want to squash. For example, imagine that one column is mother
and the other is child
. Mothers might have more than one child, but each child only has one mother. So in my ideal output, there would be one row per mother, with the mother as one column and a list of children as the other. My current data instead has one row per child, with the mother duplicated.
I can do this in R, based on this very useful blog post. So for example:
library(data.table)
dt <- data.table(mother = c("Penny", "Penny", "Anya", "Sam", "Sam", "Sam"),
child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"))
dt[, children := .(list(unique(child))), by = mother]
dt[, child := NULL]
dt <- unique(dt, by = "mother")
My output looks like this:
mother children
1: Penny Violet,Prudence
2: Anya Erika
3: Sam Jake,Wolf,Red
However, I am working in Python with pandas and I can't figure out something equivalent. I can see that it is possible to have a list in a pandas dataframe cell, but I'm not sure how to manipulate the existing data to produce this. Perhaps pandas is not the right thing to use but it seems like it would be.
Here's my dataframe as things stand:
import pandas as pd
data = {'mother':["Penny", "Penny", "Anya", "Sam", "Sam", "Sam"],'child':["Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"]}
df = pd.DataFrame(data)
Apologies if this is basic - I am new to Python and can't find the answer on here.