0

i have created a following dataset

dataset1 = { 'srid':[1,2,3,1,5],
            'custid':[11,12,43,12,34],
            'orderdate':["1/2/2019","1/2/2019","2/2/2019","1/2/2019","1/2/2019"],
            'Rev':[100,101,102,103,17]
}

df1 = pd.DataFrame(dataset1)

I have to mark every sales representative as : Excellent(where his total revenue for a day is 1.5 times or greater than average revenue of all sales representative for that day) Good (where his total revenue for a day is less than 1.5 times and greater than equal to 1.1 times average revenue of all sales representative for that day) Average (where his total revenue for a day is less than 1.1 times and greater than equal to 0.9 times the average revenue of all sales representative for that day) Poor (where his total revenue for a day is less than 0.9 times the average revenue of all sales representative for that day) for every date present in ‘Dataset 1’ Output Dataset: Sales Representative ID,Order Date, Marking

what i tried is:

g=df.groupby(df['orderdate'])
ans={}
for od,od_df in g:
   # print(od)
    ans[od]=list()
    x=od_df["Rev"].mean()
    s=set(od_df["srid"].tolist())
    for i in s:
        p=od_df[od_df["srid"]==i]["Rev"].sum()
        val = p/x
        if val>=1.5:
            ans[od].append([i,od,"Excellent"])
        elif 1.1<=val<1.5:
            ans[od].append([i,od,"good"])
        elif 0.9<=val<1.1:
            ans[od].append([i,od,"avg"])
        else:
            ans[od].append([i,od,"poor"])

But that is alot to write and will take more time on big dataset how can i optimise it further

yolon bsn
  • 89
  • 1
  • 1
  • 4

3 Answers3

2

I can't really test the speed of the implementation in comparison to yours as for 5 values it measures the overhead more than anything else. However bare loops tend to be rather inefficient in pandas. You can get the relative difference to the mean like this:

In [15]: df.groupby('orderdate').apply(lambda _df: _df['Rev'] / _df['Rev'].mean())                                                                                                                         
Out[15]: 
orderdate   
1/2/2019   0    1.2461
           1    1.2586
           3    1.2835
           4    0.2118
2/2/2019   2    1.0000
Name: Rev, dtype: float64

and use pd.cut to transform it to an ordinal scale

In [28]: df['RevMark'] = pd.cut(df.groupby('orderdate').apply(lambda _df: _df['Rev'] / _df['Rev'].mean()).sort_index(level=1).values, [0,0.9,1.1,1.5,np.inf], labels=['poor', 'avg', 'good', 'excellent']) 

In [29]: df                                                                                                                                                                                                
Out[29]: 
   srid  custid orderdate  Rev RevMark
0     1      11  1/2/2019  100    good
1     2      12  1/2/2019  101    good
2     3      43  2/2/2019  102     avg
3     1      12  1/2/2019  103    good
4     5      34  1/2/2019   17    poor

The sort_index is necessary because after the grouping the values are ordered by date and thus the association would be wrong. The second argument of cut is the borders of your desired intervals and the labels are What you call Marks.

In terms of time you would need to measure that on a suffiently large sample yourself.

maow
  • 2,712
  • 1
  • 11
  • 25
1

Simply add a column to be calculated and 'apply' to the result

df1['mean'] = df1.loc[:,['orderdate','Rev']].groupby('orderdate').transform('mean')
df1['Representative'] = df1['Rev']/df1['mean']

def rep(x):
    if x >= 1.5:
        return 'Excellent'
    elif 1.1<=x<1.5:
        return 'good'
    elif 0.9<=x<1.1:
        return 'avg'
    else:
        return 'poor'

df1['Marking'] = df1['Representative'].apply(rep)
df1
    srid    custid  orderdate   Rev mean    Representative  Marking
0   1   11  1/2/2019    100 80.25   1.246106    good
1   2   12  1/2/2019    101 80.25   1.258567    good
2   3   43  2/2/2019    102 102.00  1.000000    avg
3   1   12  1/2/2019    103 80.25   1.283489    good
4   5   34  1/2/2019    17  80.25   0.211838    poor
r-beginners
  • 31,170
  • 3
  • 14
  • 32
  • Thanks that is really helpful .. can you suggest some good resources to learn some writing complex queries using groupby and trasform – yolon bsn Jun 25 '20 at 08:08
  • I don't know of any site that explains it in detail either, but I just looked it up and found [SO](https://stackoverflow.com/questions/33730092/what-is-the-sql-equivalent-to-pandas-transform), [SO](https://stackoverflow.com/questions/22105452/what-is-the-equivalent-of-sql-group-by-having-on-pandas), [TDS](https://towardsdatascience.com/group-by-in-pandas-and-sql-94777ce48658), etc. – r-beginners Jun 25 '20 at 08:19
0

I would do something like that

list_df = []
for date in df.orderdate.unique():
    df_cur = df[df.orderdate==date]
    mean = df_cur.rev.Mean()
    def get_rank(rev):
        if rev > 1.5*mean:
           return "Good"
        elif rev .......
    df_cur["Ranking"]=df_cur.rev.apply(get_rank)
    list_df.append(df_cur)
df_final = pd.concat(list_df)

I mean, you need to tweak it, but I think it should work

For each date, I get the following dataframe, and then I get the ranking for the day. In the end, I merge all the dataframes to get all the dates.

xzelda
  • 172
  • 7
  • Why are you defining a function in a loop? Why call `apply` in a loop when it applies to a whole Series? – roganjosh Jun 25 '20 at 08:00
  • The rank and the ranking depend on the date, so the function that gives the rank changes with everydate. For each date, I get the following dataframe, and then I get the ranking for the day. In the end, I merge all the dataframes to get all the dates. – xzelda Jun 25 '20 at 08:03