36

Here is a simple example of my problem:

> df <- data.frame(ID=1:10,Score=4*10:1)
> df
       ID Score
    1   1    40
    2   2    36
    3   3    32
    4   4    28
    5   5    24
    6   6    20
    7   7    16
    8   8    12
    9   9     8
    10 10     4
    > diff(df)

Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : 
  non-numeric argument to binary operator

Can anyone tell me why this error occurs?

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
ManInMoon
  • 6,795
  • 15
  • 70
  • 133
  • 1
    note that ``aPaulT`` answered your question and it would help the site if you selected it as correct by clicking on the tick mark. Or any of the other excellent solutions, if you think that would be more helpful. – PatrickT Nov 11 '18 at 16:18

8 Answers8

44

diff wants a matrix or a vector rather than a data frame. Try

data.frame(diff(as.matrix(df)))
aPaulT
  • 593
  • 3
  • 8
  • +1 nice - I suppose it makes sense since a matrix is just a vector with a 2nd dim attribute. – Simon O'Hanlon Apr 25 '13 at 10:20
  • 1
    For reference, some [benchmarks](http://rpubs.com/mrdwab/bmft) on a 1000000 row by 5 column dataset. – A5C1D2H2I1M1N2O1R2T1 Apr 25 '13 at 11:30
  • Note that unlike the ``head/tail`` and ``nrow(df)`` solutions, this will not work on certain types of non-numeric objects, such as POSIXlt. (A ``dplyr`` solution also fails with POSIXlt objects) – PatrickT Nov 11 '18 at 16:13
  • Indeed benchmarks suggest ``diff(as.matrix)`` and ``data.table`` are [fastest](https://gist.github.com/ptoche/7917cfcec9391656d8bdf727f3e1d808). But I'll probably go on using ``head/tail`` or ``seq_along`` for years to come. – PatrickT Nov 12 '18 at 15:20
28

Perhaps you are looking for something like this:

> tail(df, -1) - head(df, -1)
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

You can subtract or add two data.frames together if they are the same dimensions. So, what we are doing here is subtracting one data.frame that is missing the first row (tail(df, -1)) and one that is missing the last row (head(df, -1)) and subtracting them.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
14

Another option using dplyr would be using mutate_each to loop through all the columns, get the difference of the column (.) with the lag of the column (.) and remove the NA element at the top with na.omit()

library(dplyr)
df %>%
    mutate_each(funs(. - lag(.))) %>%
    na.omit() 

EDIT:

Instead of mutate_each (deprecated - as mentioned by @PatrickT) use mutate_all

df %>%
    mutate_all(funs(. - lag(.))) %>%
    na.omit() 

Or with shift from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), loop through the columns (lapply(.SD, ..)) and get the difference between the column (x) and thelag(shiftby default gives thelagastype = "lag"`). Remove the first observation i.e. NA element.

library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]
akrun
  • 874,273
  • 37
  • 540
  • 662
12

Because df works on vector or matrix. You can use apply to apply the function across columns like so:

 apply( df , 2 , diff )
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:

apply( df[-1] , 2 , diff )

Or you could use data.table (not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff to the ID column:

DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score))  ]
    ID Score Diff
 1:  1    40   -4
 2:  2    36   -4
 3:  3    32   -4
 4:  4    28   -4
 5:  5    24   -4
 6:  6    20   -4
 7:  7    16   -4
 8:  8    12   -4
 9:  9     8   -4
10: 10     4   -4

And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:

DT[, lapply(.SD, diff), .SDcols = 1:2]

Here .SDcols = 1:2 means you want to apply the diff function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20 as an example.

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • +1 for the edits. It will make it less confusing for future visitors to this question. – A5C1D2H2I1M1N2O1R2T1 Apr 25 '13 at 10:55
  • I tried this but the last row is 0. how do I do the difference from yesterday - today and put the value in today's row so that the last columns are not 0? +1 – floss Nov 07 '20 at 20:59
5

I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.

sqldf(paste("SELECT a.ID,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df b"
            ,"                   WHERE b.ID < a.ID"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
            ," FROM df a"
            )
      )

The code seems complicated but it is not and it has some advantage, as you can see at the results:

    ID Score diff
 1   1    40 <NA>
 2   2    36 -4.0
 3   3    32 -4.0
 4   4    28 -4.0
 5   5    24 -4.0
 6   6    20 -4.0
 7   7    16 -4.0
 8   8    12 -4.0
 9   9     8 -4.0
 10 10     4 -4.0

One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:

df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)

sqldf(paste("SELECT a.ID,a.grp,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df2 b"
            ,"                   WHERE b.ID < a.ID"
            ,"                         AND a.grp = b.grp"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
     ," FROM df2 a"
     )
)


   ID grp Score diff
1   1   v    40 <NA>
2   2   v    36 -4.0
3   3   v    32 -4.0
4   4   v    28 -4.0
5   5   v    24 -4.0
6   6   w    20 <NA>
7   7   w    16 -4.0
8   8   w    12 -4.0
9   9   w     8 -4.0
10 10   w     4 -4.0
giordano
  • 2,954
  • 7
  • 35
  • 57
5

Adding this a few years later for completeness- you can use a simple [.data.frame subseting in order to achieve this too

df[-1, ] - df[-nrow(df), ]
#    ID Score
# 2   1    -4
# 3   1    -4
# 4   1    -4
# 5   1    -4
# 6   1    -4
# 7   1    -4
# 8   1    -4
# 9   1    -4
# 10  1    -4
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • @akrun, you had some nice stuff there.The dplyr one will bring you a lot of upvotes in the long run- guaranteed. Please post – David Arenburg Aug 05 '16 at 08:35
  • Okay then, though it is against my principles :-) – akrun Aug 05 '16 at 08:35
  • Even later... I am new to R and trying to avoid libraries. Everyone suggests 5 different libs for one task which could be solved with plain R. I don't get this mentality of adding dependencies to solve simple problems. – r2p2 Mar 19 '17 at 10:24
  • 1
    @r2p2 Well, this is the result of constant labeling R as "hard to learn" language vs dplyr/etc. as "very intuitive and easy". A big part of the R community, as it stands right now, is like a sheep blindly following the divine leader Hadley Wickham- and you can't do pretty much nothing about it anymore. – David Arenburg Mar 19 '17 at 11:06
1

Look this answer: Compute difference between rows in R and setting in zero first difference

I think that is the easiest way.

df <- data.frame(ID=1:8, x2=8:1, x3=11:18, x4=c(2,4,10,0,1,1,9,12))
df$vardiff <- c(0, diff(df$x4))
df
walves
  • 2,026
  • 6
  • 28
  • 46
0

My personal easy way is to use pivot_wider to create another column then you can use summarise to subtract the difference :D

It's all from dyplr package

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459