I have csv file like this:
A,B,C,X
a,a,a,1.0
a,a,a,2.1
a,b,b,1.2
a,b,b,2.4
a,b,b,3.6
b,c,c,1.1
b,c,d,1.0
(A, B, C)
is a "primary key" in this dataset, that means this set of columns should be unique. What I need to do is to find duplicates and present associated values (X
column) in separate columns, like this:
A,B,C,X1,X2,X3
a,a,a,1.0,2.1,
a,b,b,1.2,2.4,3.6
I somehow know how to find duplicates and aggregate X
values into tuples:
df = data.groupby(['A', 'B', 'C']).filter(lambda group: len(group) > 1).groupby(['A', 'B', 'C']).aggregate(tuple)
This is basically what I need, but I struggle with transforming it further.
I don't know how many duplicates for a given key I have in my data, so I need to find some max value and compute columns:
df['items'] = df['X'].apply(lambda x: len(x))
columns = [f'x_{i}' for i in range(1, df['X'].max() + 1)]
and then create new dataframe with new columns:
df2 = pd.DataFrame(df['RATE'].tolist(), columns=columns)
But at this point I lost index :shrug:
This page on Pandas docs suggests I should use something like this:
df.pivot(columns=columns, values=['X'])
because df
already contains an index, but I get this (confusing) error:
KeyError: "None of [Index(['x_1', 'x_2'], dtype='object')] are in the [columns]"
What am I missing here?