Reformating dataframe by rows

Question

I need your help on a complicated (for me) groupby on pandas. I have the following dataframe.

id	target	feature_id	feature_value
6f281111	0	right-acx_02	5.433747
6f281111	0	right-min	0.097000
a1111628	1	left_ss_01	-0.650741
a1111628	1	right-wcs	3.359375
6f282222	0	left_l_tf	0.000000
.....	.....	.....	......

For each "id" I have around 400 rows representing a different feature. I want to reformat the dataframe to have something like that :

index	target	right-acx_02	right-min	left_ss_01	right-wcs	left_l_tf	...
6f281111	0	5.433747	0.097000	NA	NA	NA	...
a1111628	1	NA	NA	-0.650741	3.359375	NA	...
6f282222	0	NA	NA	NA	NA	0.000000	...
...	...	....	....	....	....	...	...

This is what I did by now it works but I know it's far from being optimal.

unique_uuids = list(raw_df["evaluation_uuid"].unique())
values = np.insert(raw_df["feature_id"].unique(), 0, ["user_id"])
dict_feature = dict.fromkeys(np.insert(raw_df["feature_id"].unique(), 0, ["user_id"]))
pre_df = {uuid:dict_feature for uuid in unique_uuids}
df_transformed = pd.DataFrame.from_dict(pre_df, orient="index")

for index, row in tqdm(self.raw_df.iterrows()):
    df_transformed.loc[row['evaluation_uuid'], row["feature_id"]] = row["feature_value"]
    df_transformed.loc[row['evaluation_uuid'], "user_id"] = row["user_id"]
    
df_transformed = df_transformed.fillna(0)

I imagine there is a vectorized solution to this problem, if you need any detail let me know.

Thank you in advance.

`df.pivot(index=['id','target'], columns='feature_id',values='feature_value')` — Chris, Jan 03 '21 at 15:12
Hi Chris, thank you for the answer, unfortunately, it doesn't work :( I have the following error : ValueError: Index contains duplicate entries, cannot reshape — Lord exec, Jan 03 '21 at 15:43
Does this answer your question? [Unmelt Pandas DataFrame](https://stackoverflow.com/questions/31306741/unmelt-pandas-dataframe) — Prayson W. Daniel, Jan 03 '21 at 16:08

score 0 · Answer 1 · answered Jan 03 '21 at 15:45

pivot() is simplest way
if you do not want multiple-index columns, set_index() and unstack() give you full control

df = pd.read_csv(io.StringIO("""id  target  feature_id  feature_value
6f281111    0   right-acx_02    5.433747
6f281111    0   right-min   0.097000
a1111628    1   left_ss_01  -0.650741
a1111628    1   right-wcs   3.359375
6f282222    0   left_l_tf   0.000000"""), sep="\t")

df = df.set_index(["id","target","feature_id"]).unstack("feature_id")
df.columns = df.columns.droplevel()
df.reset_index()

output

       id  target  left_l_tf  left_ss_01  right-acx_02  right-min  right-wcs
 6f281111       0        NaN         NaN      5.433747      0.097        NaN
 6f282222       0        0.0         NaN           NaN        NaN        NaN
 a1111628       1        NaN   -0.650741           NaN        NaN   3.359375

Didn't realize I had duplicates on my data thanks a lot ! – Lord exec Jan 03 '21 at 17:32 — Lord exec, Jan 03 '21 at 17:32

Pierre D · Accepted Answer · 2021-01-03T16:04:26.047

Since from the comments it sounds like @Chris canonical solution isn't working for you, it may be that you have duplicate rows (based on ('id', 'target', 'feature_id')).

In order to find them, try:

dupes = df.loc[df.duplicated(subset=['id', 'target', 'feature_id'])]

If not dupes.empty, then you know you have a problem with the data.

You can decide how to handle that problem (Raise error? Drop all except the first row among each duplicates? Take the mean?). Here is an example where we just take the mean feature_value for duplicates:

df2 = df.groupby(['id', 'target', 'feature_id']).mean().squeeze().unstack('feature_id')

This should work even in presence of dupes.

Reformating dataframe by rows

2 Answers2

output