4

hi I have a huge data set as below

sample data:-

customerId  products
0            20
1           2|2|23|
0           111|29|11|11|33|11|33
3           164|227
1           2|2

Now I want to convert this data set as below

    customerId  products        purchase_count
     0              20           1
     0              111          1
     0              29           1
     0              11           3
     0              33           2
     1              2            4
     1              23           1
     3              164          1
     3              227          1

Please help me with this

Rahul Varma
  • 550
  • 5
  • 23

3 Answers3

3

you can use as below:

df.products=df.products.str.split("|")
df_new=pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns).\
                                   replace('',np.nan).dropna()
print(df_new.groupby(['customerId','products'],as_index=False)['products'].\
                   apply(lambda x: x.count()).reset_index(name='purchase_count'))

   customerId products  purchase_count
0           0       11               3
1           0      111               1
2           0       20               1
3           0       29               1
4           0       33               2
5           1        2               4
6           1       23               1
7           3      164               1
8           3      227               1
anky
  • 74,114
  • 11
  • 41
  • 70
3

This is unnesting problem

df['products']=df.products.str.split('|')
s=unnesting(df,['products'])
s.groupby(s.columns.tolist()).size()
products  customerId
11        0             3
111       0             1
164       3             1
2         1             4
20        0             1
227       3             1
23        1             1
29        0             1
33        0             2
dtype: int64

def unnesting(df, explode):
    idx=df.index.repeat(df[explode[0]].str.len())
    df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
    df1.index=idx
    return df1.join(df.drop(explode,1),how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Using str.split with unnest and groupby.count:

df['products'] = df['products'].str.rsplit('|')

def wen2(df):
    return pd.DataFrame({'customerId':df['customerId'].repeat(df['products'].str.len()),
                         'products':np.concatenate(df['products'].values)})

df = wen2(df)
df = df.loc[df['products'].ne(''), :]
df = (df.groupby(['customerId','products'])['products'].count()
        .reset_index(name='purchase_count'))

print(df)

   customerId products  purchase_count
0           0       11               3
1           0      111               1
2           0       20               1
3           0       29               1
4           0       33               2
5           1        2               4
6           1       23               1
7           3      164               1
8           3      227               1
Space Impact
  • 13,085
  • 23
  • 48