I have a super-large dataframe of customers, item categories and their price. I would like to do some initial investigations:
- Identify the top e.g n=5 customers based on their TOTAL spending.
- for each of those customer, identify the top categories that they spend.
- Then possibly make a plot on descending order showing the top customer with their name as X and their spending as Y. For each, how to show their shopping categories?
this would require to pivot and sort. This is a sample-data generator, thanks to here .
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed(42)
n = 20
cols = np.array(['cust', 'cat'])
arr1 = (np.random.randint(5, size=(n, 2)) // [2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 1).round(2)).add_prefix('val')
)
print(df)
df.pivot_table(index=['cust'],values=['val0'],aggfunc=[np.sum])
df.pivot_table(index=['cust','cat'],values=['val0'],aggfunc=[np.size,np.sum])
# the order according the previous line should be cust1,cust0,cust2. How to do? The following is the desired output in this case.
size sum
val0 val0
cust cat
cust1 cat4 6.0 4.27
cat3 2.0 1.07
cat2 2.0 0.98
cat0 2.0 0.44
cat1 2.0 0.43
cust0 cat1 1.0 0.94
cat4 1.0 0.91
cat2 1.0 0.66
cat3 1.0 0.03
cust2 cat1 2.0 1.25
Thank you very much!