2

I have a super-large dataframe of customers, item categories and their price. I would like to do some initial investigations:

  • Identify the top e.g n=5 customers based on their TOTAL spending.
  • for each of those customer, identify the top categories that they spend.
  • Then possibly make a plot on descending order showing the top customer with their name as X and their spending as Y. For each, how to show their shopping categories?

this would require to pivot and sort. This is a sample-data generator, thanks to here .

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed(42)
n = 20

cols = np.array(['cust', 'cat'])
arr1 = (np.random.randint(5, size=(n, 2)) // [2, 1]).astype(str)
df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 1).round(2)).add_prefix('val')
)
print(df)

df.pivot_table(index=['cust'],values=['val0'],aggfunc=[np.sum])

df.pivot_table(index=['cust','cat'],values=['val0'],aggfunc=[np.size,np.sum])

# the order according the previous line should be cust1,cust0,cust2. How to do? The following is the desired output in this case.

                size sum
                val0 val0
cust    cat
cust1   cat4    6.0  4.27
        cat3    2.0  1.07
        cat2    2.0  0.98
        cat0    2.0  0.44
        cat1    2.0  0.43

cust0   cat1    1.0  0.94
        cat4    1.0  0.91
        cat2    1.0  0.66
        cat3    1.0  0.03

cust2   cat1    2.0  1.25

Thank you very much!

physiker
  • 889
  • 3
  • 16
  • 30

2 Answers2

1

Here is better aggregate sum for avoid MultiIndex in columns.

First aggregate sum:

s = df.groupby('cust')['val0'].sum()
print (s)
cust
cust0    2.54
cust1    7.19
cust2    1.25
Name: val0, dtype: float64

Then get top values by Series.nlargest:

top5 = s.nlargest(5)
print (top5)
cust
cust1    7.19
cust0    2.54
cust2    1.25
Name: val0, dtype: float64

If necessary filter only top5 values by boolean indexing and isin:

df1 = df[df['cust'].isin(top5.index)].copy()
#print(df1)

For correct ordering cust create ordered categoricals and aggregate by both filtered columns, last sort by first level cust with column size:

df1['cust'] = pd.Categorical(df1['cust'], ordered=True, categories=top5.index)
df2 = (df1.groupby(['cust','cat'])['val0'].agg([np.size,np.sum])
         .sort_values(['cust','size'], ascending=[True, False])
         .reset_index())
print (df2)
    cust   cat  size   sum
0  cust1  cat4   6.0  4.27
1  cust1  cat0   2.0  0.44
2  cust1  cat1   2.0  0.43
3  cust1  cat2   2.0  0.98
4  cust1  cat3   2.0  1.07
5  cust0  cat1   1.0  0.94
6  cust0  cat2   1.0  0.66
7  cust0  cat3   1.0  0.03
8  cust0  cat4   1.0  0.91
9  cust2  cat1   2.0  1.25

Last pivot and plot by DataFrame.plot.bar:

df2.pivot('cust','cat','size').plot.bar()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks @jezrael , however you put by hand the order of categories which are here just 3. How to do automatically when we want e.g top n=100 customers? – physiker Feb 13 '19 at 14:27
  • @physiker - oops, give em some time – jezrael Feb 13 '19 at 14:29
  • 1
    Fantastic @jezrael !! Highly appreciate your time and effort. BTW, I really like your profile code to reveal your email :) – physiker Feb 13 '19 at 15:07
0

Not sure if I understood the exact overview you want, but this is also possible by using groupby and then the agg method like the following:

overview = df.groupby('cust').agg({'val0':'sum',
                                   'cat':'max'}).reset_index().sort_values('val0', ascending=False)

    cust    val0    cat
1   cust1   7.19    cat4
0   cust0   2.54    cat4
2   cust2   1.25    cat1

So this aggregates on customer level and sums their total spending and shows the category which they spent on the most.

You dont have to use the reset_index() function, in that case it yields to the following:

overview = df.groupby('cust').agg({'val0':'sum',
                                   'cat':'max'}).sort_values('val0', ascending=False)


        val0    cat
cust        
cust1   7.19    cat4
cust0   2.54    cat4
cust2   1.25    cat1

And finally if you want the top 5 customers, you can simply use .head(5) since the values are already sorted descending:

overview = df.groupby('cust').agg({'val0':'sum',
                                   'cat':'max'}).sort_values('val0', ascending=False).head(5)
Erfan
  • 40,971
  • 8
  • 66
  • 78