I want to sort a dataframe based on the difference of two rows of a single column

Question

I have a dataframe.

      Item Type            Year_Month   Total Cost
        Cereal             Jul-2017     6000
        Cereal             Jun-2017     5000
     Baby Food             Jul-2017     3000
    Baby Food              Jun-2017     2900 
        Snacks             Jul-2017     4500
       Snacks              Jun-2017     4000

I wnat to sort the dataframe according to the difference of two rows of a single column. For example For Cereal the difference is 6000-5000 =1000 and for Snacks the difference is 4500-4000 = 500 and for baby food the difference is 3000- 2900 = 100

So the output should be like

  Item Type            Year_Month   Total Cost
    Cereal             Jul-2017     6000
    Cereal             Jun-2017     5000
    Snacks             Jul-2017     4500
   Snacks              Jun-2017     4000
 Baby Food             Jul-2017     3000
Baby Food              Jun-2017     2900

Use `df.sort_values(by=['Cost'], ascending=False)` – Space Impact Jun 11 '19 at 06:40 — Space Impact, Jun 11 '19 at 06:40
you may introduce more complex example. – Mohamed Thasin ah Jun 11 '19 at 07:46 — Mohamed Thasin ah, Jun 11 '19 at 07:46

score 3 · Answer 1 · 2019-06-12T21:21:37.000

First you need to calculate the differences for each item type. One of the ways, how to do this with pandas would be to use pivot_tables. Here you tell it which dataframe (df), based on which columns to calculate (values="TotalCost"), what function to use to calculate it (aggfunc=np.diff) and how to group them (index=["ItemType"]).

diff = pandas.pivot_table(df, values="TotalCost", index=["ItemType"], aggfunc=np.diff)

You case above only have 2 possible months. If you had more than two, then np.diff would give you values in a list. In this case you have two options. Either you filter the data frame, so there are only two months in it. This can be done like this:

df = df[[a or b for a, b in zip(df["Year_Month"] == "Jul-2017", df["Year_Month"] == "Jun-2017")]]

The other option is that you calculate the mean difference in months. This can be done with the following function, which you would then replace np.diff with:

def mean_diff(l):
    return np.mean(np.diff(l))

Then you can use this to calculate the difference for each element:

df["Diff"] = [float(diff.loc[d]) for d in df["ItemType"]]

After that, you just sort by the difference (and then by item, in case there are multiple items with the same difference)

df.sort_values(by=["Diff", "ItemType", "Year_Month"]).drop(columns = 'Diff')

IT is throwing one error i.e. Must produce aggregated value in case of using np.diff. If I replace it with np.mean or np.sum it wont throw any error? — Kanika Singhal, Jun 12 '19 at 04:39
Do you have multiple months in the table (more than two)? In this case the np.diff would give you the list of values, while the np.mean and np.sum would give you one element. — , Jun 12 '19 at 21:16

Raymond Toh · Answer 2 · 2019-06-11T07:51:10.150

2

You can try this code.

df['diff'] = df.groupby('Item_Type').diff(periods=-1)

df.loc[df['diff'].isnull(),'diff'] = df['diff'].shift(1)

df = df.sort_values(by = ['diff'], ascending = False)

df = df.drop(columns = 'diff')

df

edited Jun 11 '19 at 07:51

answered Jun 11 '19 at 07:45

Raymond Toh

779
1
8
27

Why we are shifting the dataframe? – Kanika Singhal Jun 12 '19 at 04:38
Basically, i am copying the value to replace the NaN value. – Raymond Toh Jun 12 '19 at 10:12

I want to sort a dataframe based on the difference of two rows of a single column

2 Answers2