Create a column based on computation of a another column

Question

I would like to create another column based on the sales for the previous week. Here is the sample input:

df = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
                   'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
                   'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
                   'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
                  'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20]})
print(df)

Based on this, i would like to create another column which is nothing but the sales of the previous week. Here is the sample of the desired output

    df_output = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
                       'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
                       'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
                       'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
                      'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20],
                      'Sales_previous_week':[0,0,100,200,300,400,100,200,300,400,0,0,100,200,300,400,100,200,300,400]})
print(df_output)

Am finding it hard to create what would be a self join. The previous week should only be influenced by sales file and i should be able to retain the "sales_others" column

--Edit Adding original code

CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)

Renaming columns

CR_UK_NL_Weeklevel.columns.values[4] = 'CURRENT_WEEK'
CR_UK_NL_Weeklevel.columns.values[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns.values

Trying to implement solution:

CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)

[78]:

CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)

--Error

--------------------------------------------------------------------------- KeyError Traceback (most recent call last) in ----> 1 CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0) 2 print(CR_UK_NL_Weeklevel) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in getitem(self, key) 273 else: 274 if key not in self.obj: --> 275 raise KeyError("Column not found: {key}".format(key=key)) 276 return self._gotitem(key, ndim=1) 277 KeyError: 'Column not found: CURRENT_WEEK'

jezrael · Accepted Answer · 2020-04-27T10:35:07.427

If there are always same categories per week and consecutive weeks use DataFrameGroupBy.shift grouping by Category column:

df['Sales_PREVIOUS'] = df.groupby('Category')['Sales'].shift(fill_value=0)
print (df)
   Week Category  Sales  Sales_PREVIOUS
0     1      Red    100               0
1     1    White    200               0
2     2      Red    300             100
3     2    White    400             200
4     3      Red    100             300
5     3    White    200             400
6     4      Red    300             100
7     4    White    400             200
8     5      Red    100             300
9     5    White    200             400

Another idea with pivoting is use DataFrame.pivot, then DataFrame.shift with DataFrame.stack for Series and last add new column by DataFrame.join:

s = df.pivot('Week','Category','Sales').shift(fill_value=0).stack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['Week','Category'])

EDIT:

With new data add column id:

df['Sales_PREVIOUS'] = df.groupby(['id','Category'])['Sales'].shift(fill_value=0)

And for second solution:

s = df.set_index(['Week','id','Category'])['Sales'].unstack([1,2]).shift(fill_value=0).unstack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['id','Category','Week'])
print (df)
    Week Category  id  Sales  Sales_others  Sales_PREVIOUS WEEK
0      1      Red   1    100            10                    0
1      1    White   1    200            20                    0
2      2      Red   1    300            30                  100
3      2    White   1    400            40                  200
4      3      Red   1    100            10                  300
5      3    White   1    200            20                  400
6      4      Red   1    300            30                  100
7      4    White   1    400            40                  200
8      5      Red   1    100            10                  300
9      5    White   1    200            20                  400
10     1      Red   2    100            10                    0
11     1    White   2    200            20                    0
12     2      Red   2    300            30                  100
13     2    White   2    400            40                  200
14     3      Red   2    100            10                  300
15     3    White   2    200            20                  400
16     4      Red   2    300            30                  100
17     4    White   2    400            40                  200
18     5      Red   2    100            10                  300
19     5    White   2    200            20                  400

EDIT:

Problem is with columns names, use:

cols = CR_UK_NL_Weeklevel.columns.tolist()
cols[4] = 'CURRENT_WEEK'
cols[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns = cols

This works perfectly, thanks again Jezrael, however can i also retain a column from the dataset. Lets say i had a column called ID in the dataset, it doesnot have any relevance to the current exercise, it will be great if i can just retain it. — vevek seetharaman, Apr 27 '20 at 08:32
When i add 1 more column, the script is not working. I have made edit to the question — vevek seetharaman, Apr 27 '20 at 09:48
i used the first! but it will be great if you can help with both. — vevek seetharaman, Apr 27 '20 at 09:54
Both the solution works when i implement it on the sample set. but when i implement it in our original dataset it says "Key Error: Column does not exist[Sales]". Am not sure how to approach this, can data type be a reason? becoz am sure the column is present in the dataset — vevek seetharaman, Apr 27 '20 at 10:05
df.columns.str.replace(' ', ''). i did to make sure there were no spaces in between or after. both the output from print (df.columns.tolist()) and script is exactly the same — vevek seetharaman, Apr 27 '20 at 10:12
['CALWEEK' 'RETAIL_SITE_ID' 'CATEGORY_NAME' 'LAST_YEAR_WEEK' 'CURRENT_WEEK'] — vevek seetharaman, Apr 27 '20 at 10:27
@vevekseetharaman - I think problem is [this](https://stackoverflow.com/questions/43291781/after-rename-column-get-keyerror), edited answer for solution - last pragraph — jezrael, Apr 27 '20 at 10:42
This worked! You are amazing! I didnt know i cannot rename columns like that — vevek seetharaman, Apr 27 '20 at 10:59

Create a column based on computation of a another column

1 Answers1