First idea is use GroupBy.cumcount
for counter of duplicated values of col1
for new columns names and reshape by Series.unstack
:
df = (df.set_index(['col1',df.groupby('col1').cumcount()])['col2']
.unstack(fill_value='')
.reset_index())
df.columns = [f'col{x}' for x in range(1, len(df.columns) + 1)]
print (df)
col1 col2 col3 col4 col5
0 file1 text_0 text_1 text_2
1 file2 text_0 text_1 text_2 text_3
2 file3 text_0
Or create Series
of list and avoid use apply(pd.Series)
, because slow, better is use DataFrame constructor:
s = df.groupby('col1')['col2'].apply(list)
df = pd.DataFrame(s.tolist(), index=s.index).reset_index().fillna('')
df.columns = [f'col{x}' for x in range(1, len(df.columns) + 1)]
print (df)
col1 col2 col3 col4 col5
0 file1 text_0 text_1 text_2
1 file2 text_0 text_1 text_2 text_3
2 file3 text_0
Alternative:
s = df.groupby('col1')['col2'].apply(list)
L = [[k] + v for k, v in s.items()]
df = pd.DataFrame(L).fillna('').rename(columns=lambda x: f'col{x+1}')
print (df)
col1 col2 col3 col4 col5
0 file1 text_0 text_1 text_2
1 file2 text_0 text_1 text_2 text_3
2 file3 text_0