0

I have a question that concerns the use of the pivot function in pandas. I have a table (df_init) with a bunch of customer Ids (7000 different Ids) and the product codes they purchased

CST_ID PROD_CODE
11111 1234
11111 2345
11111 5425
11111 9875
22222 2345
22222 9251
22222 1234
33333 6542
33333 7498

Each Id can be repeated at most 4 time in the table, but can appear less than 4 times (e,g, 22222 and 33333). I want to reorganize that table as follows (df_fin)

CST_ID PROD_1 PROD_2 PROD_3 PROD_4
11111 1234 2345 5425 9875
22222 2345 9251 1234 NaN
33333 6542 7498 NaN NaN

Good news is, I have found a way to do so. Bad news I am not satisfied as it loops over the Customer Ids nd takes a while. Namely I count the occurrences of a certain Id while looping over the column and add that to a list, then append this list as a new variable to df_init

to_append = []
for index in range(len(df_init)):
    temp = df_init.iloc[:index+1]['CST_ID'] == df_init.iloc[index]['CST_ID'] # ['CST_ID']== df_init.iloc[index]['CST_ID']]
    counter = sum(list(temp))
    to_append.append(counter)

df_init['Product_number'] = to_append

Afterwards I pivot and rename the columns

df_fin = df_init.pivot(index='CST_ID', columns='Product_number', values='PROD_CODE').rename_axis(None).reset_index()
df_fin.columns=['CST_ID', 'pdt1', 'pdt2', 'pdt3', 'pdt4']

Of course this solution works just fine, but looping in order to create the column which I use for the columns specification of the Pivot takes time. Hence I was wondering if there was a better solution (perhapes embedded already in Pandas or in the Pivot method) to do so.

Thanks to anyone who is willing to participate

Best

JacquesLeen
  • 109
  • 7

2 Answers2

2

You can vectorize the part creating the pivoting column as below. groupby + cumcount generates the increasing number by the CST_ID.

df_fin = df_init.assign(key="PROD_" + (df_init.groupby("CST_ID").cumcount()+1).astype(str))
df_fin = df_fin.pivot(index="CST_ID", columns="key", values="PROD_CODE")
df_fin

#key    PROD_1  PROD_2  PROD_3  PROD_4
#CST_ID             
#11111  1234.0  2345.0  5425.0  9875.0
#22222  2345.0  9251.0  1234.0  NaN
#33333  6542.0  7498.0  NaN     NaN
Kota Mori
  • 6,510
  • 1
  • 21
  • 25
  • Thanks a lot. I kept on thinking on how to solve it within the Pivot function and completely misregarded the possibility of grouping. – JacquesLeen Jul 12 '22 at 08:27
0

For large dataframes i would have done, but above solution works nice

import pandas

df = pandas.DataFrame(
    {
        "CST_ID": [11111, 11111, 11111, 11111, 22222, 22222, 22222, 22222, 33333, 33333, 33333, 33333],
        "PROD_CODE": [random.randint(1, 100) for _ in range(12)]
    }
)
df["Product_number"] = df.groupby(['CST_ID']).cumcount() + 1
df = df.pivot(index='CST_ID', columns='Product_number', values='PROD_CODE')
df.columns = ["PROD_%s" % _ for _ in df.columns]
#   PROD_1  PROD_2  PROD_3  PROD_4
#CST_ID             
#11111  98  11  13  38
#22222  33  13  3   61
#33333  86  35  93  23
Devyl
  • 565
  • 3
  • 8