Adding a column thats result of difference in consecutive rows in pandas

Question

Lets say I have a dataframe like this

0,1,2,3 are times, a, c, e, g is one time series and b, d, f, h is another time series. I need to be able to add two columns to the orignal dataframe which is got by computing the differences of consecutive rows for certain columns.

So i need something like this

    A   B   dA
0   a   b  (a-c)
1   c   d  (c-e)
2   e   f  (e-g)
3   g   h   Nan

I saw something called diff on the dataframe/series but that does it slightly differently as in first element will become Nan.

score 155 · Accepted Answer · edited Mar 29 '21 at 12:25

155

Use shift.

df['dA'] = df['A'] - df['A'].shift(-1)

edited Mar 29 '21 at 12:25

Jan D.M.

2,284
2
20
32

answered Apr 17 '14 at 20:43

exp1orer

11,481
7
38
51

what if 'A' contains list elements? – LNRD.CLL Dec 03 '16 at 06:36
It works as long as that operation makes sense for whatever values are in your dataframe! For example, if each "cell" in your DataFrame is a list, you can do `df['A'] + df['A'].shift()`, but you can't subtract since subtraction isn't defined for lists. – exp1orer Dec 06 '16 at 14:17
1

What can you do about the 'NaN' in the first/last row? – André Fernandes Jan 19 '17 at 20:40
1

@AndréFernandes what would you like to be there? You can always do a `fillna` afterwards, but if you're shifting down, what would be a sensible value for the first row? – exp1orer Jan 20 '17 at 20:18
For example, hourly differences of an cumulative counter. The first row has to be filled with something, either 0 or the value from another of the row's fields. – André Fernandes Jan 20 '17 at 23:07
Okay just use `fillna` on the result then. Eg `df['dA'].fillna(0)` – exp1orer Jan 21 '17 at 18:49
i am getting SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead – Parthiban Rajendran Nov 25 '18 at 14:12
1

@PaariVendhan That's a very common issue with pandas, and has to do with how you've defined your dataframe elsewhere. It's not related to the `shift` method. Here's a very long article https://www.dataquest.io/blog/settingwithcopywarning/ and a shorter stackoverflow answer that may be helpful https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – exp1orer Nov 26 '18 at 21:15
Why not just `df['dA'] = df['A'] - df['B']` ? Any benefit of using this expression? – outsider Jun 15 '20 at 05:48
@outsider you may want to re-read the question. Column `dA` does not depend on column `B`. – exp1orer Jun 16 '20 at 16:15

score 51 · Answer 2 · answered Apr 17 '14 at 20:45

51

You could use diff and pass -1 as the periods argument:

>>> df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
>>> df["dA"] = df["A"].diff(-1)
>>> df
   A   B  dA
0  9  12   5
1  4   7   2
2  2   5   1
3  1   4 NaN

[4 rows x 3 columns]

answered Apr 17 '14 at 20:45

DSM

342,061
65
592
494

1

how can I manage the same problem when 'A' has list elements only instead of scalar ? – LNRD.CLL Dec 03 '16 at 06:37
2

@LeoCella You can manage them by not allowing them to be in a list. – cs95 Jan 28 '19 at 20:48
The accepted answer is interesting, but I think this is the better answer – Marius Apr 28 '19 at 08:19

score 3 · Answer 3 · edited Feb 14 '19 at 18:24

3

When using data in CSV, this would work perfectly:

my_data = pd.read_csv('sale_data.csv')
df = pd.DataFrame(my_data)
df['New_column'] = df['target_column'].diff(1)
print(df) #for the console but not necessary

edited Feb 14 '19 at 18:24

Mark Rotteveel

100,966
191
140
197

answered Feb 14 '19 at 15:31

Seth Okeyo

59
2

Seth Okeyo · Answer 4 · 2019-02-15T12:52:12.523

1

Rolling differences can also be calculated this way:

df=pd.DataFrame(my_data)
my_data = pd.read_csv('sales_data.csv')
i=0
j=1
while j < len(df['Target_column']):
    j=df['Target_column'][i+1] - df['Target_column'][i] #the difference btwn two values in a column.
    i+=1 #move to the next value in the column.
    j+=1 #next value in the new column.
    print(j)

edited Feb 15 '19 at 12:52

answered Feb 14 '19 at 15:52

Seth Okeyo

59
2

this is when you want to calculate the rolling differences in a column in CSV, for example, you want to get the difference between two consecutive values in a column (Target_column) and store the value in a different column(New_column). Rolling differences for example, you have a column called ['Profit'], to get the differences to use in calculating marginal revenue, you will do value2 - value1, value3- Value2, Value 4 - Value 3.... etc – Seth Okeyo Feb 15 '19 at 12:13

Adding a column thats result of difference in consecutive rows in pandas

4 Answers4

Linked

Related