Simplify pivot in pandas

Question

I have a question that concerns the use of the pivot function in pandas. I have a table (df_init) with a bunch of customer Ids (7000 different Ids) and the product codes they purchased

CST_ID	PROD_CODE
11111	1234
11111	2345
11111	5425
11111	9875
22222	2345
22222	9251
22222	1234
33333	6542
33333	7498

Each Id can be repeated at most 4 time in the table, but can appear less than 4 times (e,g, 22222 and 33333). I want to reorganize that table as follows (df_fin)

CST_ID	PROD_1	PROD_2	PROD_3	PROD_4
11111	1234	2345	5425	9875
22222	2345	9251	1234	NaN
33333	6542	7498	NaN	NaN

Good news is, I have found a way to do so. Bad news I am not satisfied as it loops over the Customer Ids nd takes a while. Namely I count the occurrences of a certain Id while looping over the column and add that to a list, then append this list as a new variable to df_init

to_append = []
for index in range(len(df_init)):
    temp = df_init.iloc[:index+1]['CST_ID'] == df_init.iloc[index]['CST_ID'] # ['CST_ID']== df_init.iloc[index]['CST_ID']]
    counter = sum(list(temp))
    to_append.append(counter)

df_init['Product_number'] = to_append

Afterwards I pivot and rename the columns

df_fin = df_init.pivot(index='CST_ID', columns='Product_number', values='PROD_CODE').rename_axis(None).reset_index()
df_fin.columns=['CST_ID', 'pdt1', 'pdt2', 'pdt3', 'pdt4']

Of course this solution works just fine, but looping in order to create the column which I use for the columns specification of the Pivot takes time. Hence I was wondering if there was a better solution (perhapes embedded already in Pandas or in the Pivot method) to do so.

Thanks to anyone who is willing to participate

Best

Q/A #10 in the dup link. – Quang Hoang Jul 12 '22 at 09:04 — Quang Hoang, Jul 12 '22 at 09:04

score 2 · Answer 1 · answered Jul 12 '22 at 08:22

You can vectorize the part creating the pivoting column as below. groupby + cumcount generates the increasing number by the CST_ID.

df_fin = df_init.assign(key="PROD_" + (df_init.groupby("CST_ID").cumcount()+1).astype(str))
df_fin = df_fin.pivot(index="CST_ID", columns="key", values="PROD_CODE")
df_fin

#key    PROD_1  PROD_2  PROD_3  PROD_4
#CST_ID             
#11111  1234.0  2345.0  5425.0  9875.0
#22222  2345.0  9251.0  1234.0  NaN
#33333  6542.0  7498.0  NaN     NaN

Thanks a lot. I kept on thinking on how to solve it within the Pivot function and completely misregarded the possibility of grouping. — JacquesLeen, Jul 12 '22 at 08:27

score 0 · Answer 2 · answered Jul 12 '22 at 08:39

For large dataframes i would have done, but above solution works nice

import pandas

df = pandas.DataFrame(
    {
        "CST_ID": [11111, 11111, 11111, 11111, 22222, 22222, 22222, 22222, 33333, 33333, 33333, 33333],
        "PROD_CODE": [random.randint(1, 100) for _ in range(12)]
    }
)
df["Product_number"] = df.groupby(['CST_ID']).cumcount() + 1
df = df.pivot(index='CST_ID', columns='Product_number', values='PROD_CODE')
df.columns = ["PROD_%s" % _ for _ in df.columns]
#   PROD_1  PROD_2  PROD_3  PROD_4
#CST_ID             
#11111  98  11  13  38
#22222  33  13  3   61
#33333  86  35  93  23

Simplify pivot in pandas

2 Answers2