2

Here is a simple example:

> df <- data.frame(sn=rep(c("a","b"), 3), t=c(10,10,20,20,25,25), r=c(7,8,10,15,11,17))
> df
  sn  t  r
1  a 10  7
2  b 10  8
3  a 20 10
4  b 20 15
5  a 25 11
6  b 25 17

Expected result is

 sn  t r
1  a 20 3
2  a 25 1
3  b 20 7
4  b 25 2 

I want to group by a specific column ("sn"), leave some columns unchanged ("t" for this example), and apply diff() to remaining columns ("r" for this example). I explored "dplyr" package to try something like:

df1 %>% group_by(sn) %>% do( ... diff(r)...)

but couldn't figure out correct code. Can anyone recommend me a clean way to get expected result?

dmitriy873
  • 121
  • 1
  • 7
  • Hi @dmitriy873 if any of the answers solves your pb you might click on "accept it" so that other people know that it was solved. thanks. – agenis Sep 06 '17 at 12:42

2 Answers2

1

You can do like this (I don't use directly diff because it returns n-1 values):

library(dplyr)
df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = r-lag(r)) %>% slice(2:n())
####       sn     t     r
####   <fctr> <dbl> <dbl>
#### 1      a    20     3
#### 2      a    25     1
#### 3      b    20     7
#### 4      b    25     2

The slice fonction is here to remove the NA rows created by the differenciation at the beginning of each group. One could also use na.omit instead, but it could also remove other lines unintentionally

agenis
  • 8,069
  • 5
  • 53
  • 102
  • Thanks! This is exactly what I was looking for – dmitriy873 Jul 21 '16 at 13:55
  • Using slice() on dataframe seems to be leading to some problems:df1 <- df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = r-lag(r)) %>% slice(2:n()). Then: df1 %>% mutate(r = ifelse(r < 3, NA, r)) has - Error: corrupt 'grouped_df', contains 4 rows, and 6 rows in groups. However, df1$r[df1$r < 3] <- NA does work as expected. Any suggestions? – dmitriy873 Jul 21 '16 at 15:56
  • Ok, i'm not aware of this. Another way is to use `na.omit` as suggested in my post and by akrun, or to replace the slice call by `filter(row_number()>1)` which does the same. – agenis Jul 21 '16 at 16:01
  • @dmitriy873 I don't get the error you mention. I have a quite recent version of R (3.2.3) and `dplyr` (dplyr_0.5.0), maybe that's the difference. I also saw this thread http://stackoverflow.com/questions/30491740/weird-behavior-in-dplyr-slice-for-r that dates back to 2015 – agenis Jul 21 '16 at 16:04
  • It makes sense. My R is 3.2.2 and dplyr is 0.4.3. I will update R. Thank you! – dmitriy873 Jul 21 '16 at 18:32
0

We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), set the key as 'sn' (it will order it based on 'sn'), grouped by 'sn', get the difference of 'r' with the lag of 'r' (i.e. shift in data.table does that) and remove the NA rows with `na.rows.

library(data.table)
na.omit(setDT(df, key = "sn")[,  r := r-shift(r) , sn])
#   sn  t r
#1:  a 20 3
#2:  a 25 1
#3:  b 20 7 
#4:  b 25 2

Or if we are using diff, then make sure that length are the same as the diff output will be one less than the length of the original vector. So, we can pad with NA and later remove by filter

library(dplyr)
df %>%
   arrange(sn) %>%
   group_by(sn) %>%
   mutate(r = c(NA, diff(r))) %>%
   filter(!is.na(r))
#      sn     t     r 
#    <fctr> <dbl> <dbl>
#1      a    20     3
#2      a    25     1
#3      b    20     7
#4      b    25     2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! Both methods work well. Right now, my data is not big (~50,000) but it is growing fast. In your opinion, which method will perform better on large data sets? – dmitriy873 Jul 21 '16 at 14:04
  • @dmitriy873 Both are good, but the data.table should be faster. – akrun Jul 21 '16 at 17:32