4

I want to identify the way to calculate the difference between every n and n-2 rows, for each group. Let suppose below is my data:

Month, Laptop, Sales
Jan, HP, 1000
Feb, HP, 2000
Mar, HP, 1300
April, HP, 5000
Jan, Samsung, 1200
Feb, Samsung, 2500
Mar, Samsung, 1100
April, Samsung, 4500

Now, I need the output like below, where i am taking the difference between every n and n-2 rows, and finding the result for each group based on its monthly data.

Month, Laptop, Sales, difference
Jan, HP, 1000 , NA
Feb, HP, 2000 , NA
Mar, HP, 1300 , 300
April, HP, 5000 , 3000,
Jan, Samsung, 1200 , NA
Feb, Samsung, 2500 , NA
Mar, Samsung, 1100 , -100
April, Samsung, 4500 , 2000
Jaap
  • 81,064
  • 34
  • 182
  • 193
Jay
  • 835
  • 1
  • 6
  • 11
  • Possible duplicate of [Calculate difference between values in consecutive rows by group](https://stackoverflow.com/questions/14846547/calculate-difference-between-values-in-consecutive-rows-by-group) – denis Apr 29 '18 at 17:47

4 Answers4

5

Using dplyr::lag function, you can do

df <- df %>%
  group_by(Laptop) %>%
  mutate(difference = Sales - lag(Sales, 2))
df

# # A tibble: 8 x 4
# # Groups:   Laptop [2]
#   Month Laptop  Sales difference
#   <chr> <chr>   <int>      <int>
# 1 Jan   HP       1000         NA
# 2 Feb   HP       2000         NA
# 3 Mar   HP       1300        300
# 4 April HP       5000       3000
# 5 Jan   Samsung  1200         NA
# 6 Feb   Samsung  2500         NA
# 7 Mar   Samsung  1100       -100
# 8 April Samsung  4500       2000

Data

t <- "Month, Laptop, Sales
Jan, HP, 1000
Feb, HP, 2000
Mar, HP, 1300
April, HP, 5000
Jan, Samsung, 1200
Feb, Samsung, 2500
Mar, Samsung, 1100
April, Samsung, 4500"

df <- read.table(text = t, header = T, sep = ",", strip.white = TRUE)
hpesoj626
  • 3,529
  • 1
  • 17
  • 25
  • This answer helps me solve a similar issue. But I have a remaining problem. I am doing a sum with the row n and n-1. But the first value is `NA`. How can I obtain an integer instead, like if it was a sum with a 0 for the missing line. In this example, I would run `df <- df %>% group_by(Laptop) %>% mutate(tot = Sales + lag(Sales, 1))` and I would like the result to be 1000 for the first line and 1200 fir the 5th line. I am not 100% sure it is appropriate to add a comment 2 years later, but I don't want to create a duplicate. If someone comes accross, help would be appreciated. Thx – Mata Sep 27 '21 at 15:27
2

One can use diff function along with dplyr to get the desired output.

lag = 2
df %>% group_by(Laptop) %>%
  mutate(difference = c(rep(NA,lag), diff(Sales, lag)))

# # A tibble: 8 x 4
# # Groups: Laptop [2]
#   Month Laptop  Sales difference
#   <chr> <chr>   <int>      <int>
# 1 Jan   HP       1000         NA
# 2 Feb   HP       2000         NA
# 3 Mar   HP       1300        300
# 4 April HP       5000       3000
# 5 Jan   Samsung  1200         NA
# 6 Feb   Samsung  2500         NA
# 7 Mar   Samsung  1100      - 100
# 8 April Samsung  4500       2000

Data:

df <- read.table(text =
"Month, Laptop, Sales
Jan, HP, 1000
Feb, HP, 2000
Mar, HP, 1300
April, HP, 5000
Jan, Samsung, 1200
Feb, Samsung, 2500
Mar, Samsung, 1100
April, Samsung, 4500",
header = TRUE, sep = ",", strip.white = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
0

You can use n() that gives access to the last row number of the group in dplyr, and its equivalent .N in data.table

library(dplyr)    
df %>%
  group_by(Laptop) %>%
  mutate(difference =  c(NA,NA,Sales[3:n()]) - c(NA,NA,Sales[1:(n()-2)]) )

# A tibble: 8 x 4
# Groups:   Laptop [2]
   Month  Laptop Sales difference
  <fctr>  <fctr> <int>      <int>
1    Jan      HP  1000         NA
2    Feb      HP  2000         NA
3    Mar      HP  1300        300
4  April      HP  5000       3000
5    Jan Samsung  1200         NA
6    Feb Samsung  2500         NA
7    Mar Samsung  1100       -100
8  April Samsung  4500       2000

With data.table

library(data.table)

setDT(df)[,.(difference =  c(NA,NA,Sales[3:.N]) - c(NA,NA,Sales[1:(.N-2)]) ), by = Laptop]

    Laptop difference
1:      HP         NA
2:      HP         NA
3:      HP        300
4:      HP       3000
5: Samsung         NA
6: Samsung         NA
7: Samsung       -100
8: Samsung       2000
denis
  • 5,580
  • 1
  • 13
  • 40
0

in base R:

df$difference <- unlist(tapply(df$Sales,df$Laptop,function(x) c(NA,NA,diff(x,2))))

#   Month  Laptop Sales difference
# 1   Jan      HP  1000         NA
# 2   Feb      HP  2000         NA
# 3   Mar      HP  1300        300
# 4 April      HP  5000       3000
# 5   Jan Samsung  1200         NA
# 6   Feb Samsung  2500         NA
# 7   Mar Samsung  1100       -100
# 8 April Samsung  4500       2000
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167