1

I have a pandas dataframe

Type Name
Cheese Mozarella
Cheese Cheddar
Cheese Pepper Jack
Bread Sourdough
Bread Rye
Pepper Green

I am trying to group by the Type column, collect the list of Names and split that into at most 2 columns. So the output would be:

Type Name_primary Name_alternative
Cheese Mozarella Cheddar
Bread Sourdough Rye
Pepper Green
  • If a type has more than 2 Names, fill the first 2 for Name_primary and Name_alternative
  • If a type has just one Name, the Name_alternative will be null
  • Assumption : The first table has no duplicates and is already ordered in some way.

I got to the point of collecting list df.groupby("Type").Name.apply(list).to_frame() or df.pivot_table(index="Type", values="Name", aggfunc=list)

I could do a lambda apply on a user function that checks the length, and gets (val1, val2) or (val1, None) and create Name_Primary and Name_Alternative columns. But is there a simpler way to do this ?

newpyspark
  • 91
  • 6
  • Does this answer your question? Use the linked answer before your current code. [How to keep first two duplicates in a pandas dataframe?](https://stackoverflow.com/questions/32530856/how-to-keep-first-two-duplicates-in-a-pandas-dataframe) – G. Anderson Mar 01 '23 at 17:33
  • @G.Anderson Not entirely. I am able to get the top 2 values by doing ```df.groupby("Type").Name.nth[:2].to_frame()``` Its the next part to make those possible 2 values in a group into 2 columns that I am trying to be smart about – newpyspark Mar 01 '23 at 17:45

1 Answers1

2

Lets group the dataframe by type and create records corresponding to top two names inside a comprehension

c = ['Name_primary', 'Name_alternative']
df1 = pd.DataFrame({'Type': k, **dict(zip(c, g['Name']))} 
                   for k, g in df.groupby('type', sort=False))

Alternatively you can reshape the dataframe with pivot after selecting top two rows per Type

# create seq counter to identify unique rows per type
df['idx'] = df.groupby('type').cumcount()

# Filter rows where counter < 2 and pivot to reshape
df1 = df.query('idx < 2').pivot(index='type', columns='idx', values='Name')
df1.columns = ['Name_primary', 'Name_alternative']

Result

     Type Name_primary Name_alternative
0  Cheese    Mozarella          Cheddar
1   Bread    Sourdough              Rye
2  Pepper        Green              NaN
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53