in Pandas, append a new column of a row each time it has a duplicate ID

Question

assuming we have this table

df = pd.DataFrame({'ID': [1, 2, 1, 4, 2, 6, 1], 'Name': ['led', 'peter', 'james', 'ellie', 'make', 'levi', 'kent'], 
                   'food': ['apples', 'oranges', 'banana', 'carrots', 'carrots', 'mango', 'banana'],
                   'color': ['red', 'blue', 'pink', 'red', 'red', 'purple', 'orange']})


+----+-------+---------+--------+
| id | name  | food    | color  |
+----+-------+---------+--------+
| 1  | led   | apples  | red    |
| 2  | peter | oranges | blue   |
| 1  | james | banana  | pink   |
| 4  | ellie | carrots | red    |
| 2  | mako  | carrots | red    |
| 6  | levi  | mango   | purple |
| 1  | kent  | banana  | orange |
+----+-------+---------+--------+

The goal here is to group by id but keep appending new rows as long as duplicates are found. the output would be like this:

+----+-------+---------+--------+-------+---------+--------+-------+--------+--------+
| id | name  | food    | color  | name2 | food2   | color2 | name3 | food3  | color3 |
+----+-------+---------+--------+-------+---------+--------+-------+--------+--------+
| 1  | led   | apples  | red    | james | banana  | pink   | kent  | banana | orange |
| 2  | peter | oranges | blue   | mako  | carrots | red    |       |        |        |
| 4  | ellie | carrots | red    |       |         |        |       |        |        |
| 6  | levi  | mango   | purple |       |         |        |       |        |        |
+----+-------+---------+--------+-------+---------+--------+-------+--------+--------+

there is an existing logic for this, but it gets messed up when some of the columns on the duplicate is missing.

df = pd.DataFrame({'ID': [1, 2, 1, 4, 2, 6, 1], 'Name': ['led', 'peter', np.nan, np.nan, 'make', 'levi', 'kent'], 
                   'food': [np.nan, 'oranges', 'banana', 'carrots', 'carrots', 'mango', 'banana'],
                   'color': ['red', 'blue', 'pink', 'red', np.nan, 'purple', 'orange']})

transformed_df = df.set_index('ID').stack().droplevel(1)
counter = transformed_df.groupby('ID').cumcount().to_numpy()
transformed_df.index = [transformed_df, counter]
transformed_df = transformed_df.unstack().add_prefix('Col').reset_index()

@mozway - Solution is more complicated like `councount` + `pivot`, so reopened. — jezrael, Feb 22 '23 at 09:06

jezrael · Accepted Answer · 2023-02-22T08:48:21.833

1

Use GroupBy.cumcount for counter, then pivoting by DataFrame.pivot for all columns (omitted values parameter), then sorting by second level of MultiIndex by DataFrame.sort_index and flatten columns names in list comprehension:

df['g'] = df.groupby('ID').cumcount().add(1)
df = df.pivot(index='ID', columns='g').sort_index(axis=1, level=1, sort_remaining=True)
df.columns = [f'{x[0]}{x[1]}' if x[1] != 1 else x[0] for x in df.columns]
df = df.reset_index()
print (df)
   ID   Name   color     food  Name2 color2    food2 Name3  color3   food3
0   1    led     red   apples  james   pink   banana  kent  orange  banana
1   2  peter    blue  oranges   make    red  carrots   NaN     NaN     NaN
2   4  ellie     red  carrots    NaN    NaN      NaN   NaN     NaN     NaN
3   6   levi  purple    mango    NaN    NaN      NaN   NaN     NaN     NaN

edited Feb 22 '23 at 08:48

answered Feb 22 '23 at 08:42

jezrael

822,522
95
1,334
1,252

amazing, whats the purpose of 'g' column? to identify the end of the table? – Led Feb 22 '23 at 08:47
1

@Led - It is used for number after columns names like `food2` ... – jezrael Feb 22 '23 at 08:49
thank you, this is a complex question. I'm surprised someone know how to pull it off quickly. – Led Feb 22 '23 at 09:06

in Pandas, append a new column of a row each time it has a duplicate ID

1 Answers1