0

I need your help on a complicated (for me) groupby on pandas. I have the following dataframe.

 id target feature_id feature_value
6f281111 0 right-acx_02 5.433747
6f281111 0 right-min 0.097000
a1111628 1 left_ss_01 -0.650741
a1111628 1 right-wcs 3.359375
6f282222 0 left_l_tf 0.000000
..... ..... ..... ......

For each "id" I have around 400 rows representing a different feature. I want to reformat the dataframe to have something like that :

index target right-acx_02 right-min left_ss_01 right-wcs left_l_tf ...
6f281111 0 5.433747 0.097000 NA NA NA ...
a1111628 1 NA NA -0.650741 3.359375 NA ...
6f282222 0 NA NA NA NA 0.000000 ...
... ... .... .... .... .... ... ...

This is what I did by now it works but I know it's far from being optimal.

unique_uuids = list(raw_df["evaluation_uuid"].unique())
values = np.insert(raw_df["feature_id"].unique(), 0, ["user_id"])
dict_feature = dict.fromkeys(np.insert(raw_df["feature_id"].unique(), 0, ["user_id"]))
pre_df = {uuid:dict_feature for uuid in unique_uuids}
df_transformed = pd.DataFrame.from_dict(pre_df, orient="index")

for index, row in tqdm(self.raw_df.iterrows()):
    df_transformed.loc[row['evaluation_uuid'], row["feature_id"]] = row["feature_value"]
    df_transformed.loc[row['evaluation_uuid'], "user_id"] = row["user_id"]
    
df_transformed = df_transformed.fillna(0)

I imagine there is a vectorized solution to this problem, if you need any detail let me know.

Thank you in advance.

Lord exec
  • 25
  • 1
  • 6
  • `df.pivot(index=['id','target'], columns='feature_id',values='feature_value')` – Chris Jan 03 '21 at 15:12
  • Hi Chris, thank you for the answer, unfortunately, it doesn't work :( I have the following error : ValueError: Index contains duplicate entries, cannot reshape – Lord exec Jan 03 '21 at 15:43
  • Does this answer your question? [Unmelt Pandas DataFrame](https://stackoverflow.com/questions/31306741/unmelt-pandas-dataframe) – Prayson W. Daniel Jan 03 '21 at 16:08

2 Answers2

0
  • pivot() is simplest way
  • if you do not want multiple-index columns, set_index() and unstack() give you full control
df = pd.read_csv(io.StringIO("""id  target  feature_id  feature_value
6f281111    0   right-acx_02    5.433747
6f281111    0   right-min   0.097000
a1111628    1   left_ss_01  -0.650741
a1111628    1   right-wcs   3.359375
6f282222    0   left_l_tf   0.000000"""), sep="\t")

df = df.set_index(["id","target","feature_id"]).unstack("feature_id")
df.columns = df.columns.droplevel()
df.reset_index()

output

       id  target  left_l_tf  left_ss_01  right-acx_02  right-min  right-wcs
 6f281111       0        NaN         NaN      5.433747      0.097        NaN
 6f282222       0        0.0         NaN           NaN        NaN        NaN
 a1111628       1        NaN   -0.650741           NaN        NaN   3.359375
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

Since from the comments it sounds like @Chris canonical solution isn't working for you, it may be that you have duplicate rows (based on ('id', 'target', 'feature_id')).

In order to find them, try:

dupes = df.loc[df.duplicated(subset=['id', 'target', 'feature_id'])]

If not dupes.empty, then you know you have a problem with the data.

You can decide how to handle that problem (Raise error? Drop all except the first row among each duplicates? Take the mean?). Here is an example where we just take the mean feature_value for duplicates:

df2 = df.groupby(['id', 'target', 'feature_id']).mean().squeeze().unstack('feature_id')

This should work even in presence of dupes.

Pierre D
  • 24,012
  • 7
  • 60
  • 96