0

I have a dataset with 3 variables. Two are are factor variables( Policy_num and presidentnumber). The 3rd variables is a continues value (pred). I would like to create a new variable that is the first difference of pred foreach presidentnumber and Policy_num. The following code works but produces for me just the first difference of pred by presidentnumber. The dataframe is named dydx. This seems so simple and yet, I'm stumped.

newobject2 = dydx %>%
   group_by(Policy_num,presidentnumber) %>%
   mutate(dydx2 = pred-lag(pred))

produces this:

   ob Polic_num    Pres    pred     dydx2
   1 SocialWelfare Reagan  5.215365  NA
   2 SocialWelfare Reagan  4.373108 -0.8422576
   3 Agriculture   Reagan  5.180910  0.8078020
   4 Agriculture   Reagan  4.338652 -0.8422576
   5 Commerce      Reagan  5.206816  0.8681638
   6 Commerce      Reagan  4.364558 -0.8422576

It should look like this:

ob Polic_num    Pres    pred     dydx2
 1 SocialWelfare Reagan  5.215365  NA
 2 SocialWelfare Reagan  4.373108 -0.8422576
 3 Agriculture   Reagan  5.180910  NA
 4 Agriculture   Reagan  4.338652 -0.8422576
 5 Commerce      Reagan  5.206816  NA
 6 Commerce      Reagan  4.364558 -0.8422576

Here is code for verifiable example.

 presidentnumber = c("Reagan", "Reagan", "Reagan", "Reagan", "Bush", "Bush", 
 "Bush", "Bush", "Clinton", "Clinton", "Clinton", "Clinton")
 Policy_num=c("Agriculture", "Agriculture", "Social", "Social","Agriculture", 
 "Agriculture", "Social", "Social","Agriculture", "Agriculture", "Social", 
 "Social")
 pred=seq(1:12)
 ND=as.data.frame(cbind.data.frame(presidentnumber, Policy_num, pred))

 newobject4=ND %>%
   group_by(Policy_num, presidentnumber ) %>% 
   mutate(dydx2 = c(NA, diff(pred))) 

What this produces is this:

  Obs presidentnum Policy_num pred dydx2
  1   Reagan       Agriculture 1   NA
  2   Reagan       Agriculture 2   1
  3   Reagan       Social      3   1
  4   Reagan       Social      4   1
  5   Bush         Agriculture 5   1
  6   Bush         Agriculture 6   1
  7   Bush         Social      7   1
  8   Bush         Social      8   1
  9   Clinton      Agriculture 9   1
 10   Clinton      Agriculture 10  1
 11   Clinton      Social      11  1
 12   Clinton      Social      12  1

However, every other 1 above should be NA.

Heather Ba
  • 33
  • 6
  • 1
    Can you please clarify what you mean by "first difference of pred"? Is it just `pred[2] - pred[1]` in each group? If so, then `dydx %>% group_by( Policy_num, presidentnumber ) %>% summarize( dydx2 = pred[2] - pred[1] )` should work. It would also help to see [exemplar data and expected output](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Artem Sokolov Jul 15 '18 at 22:39
  • Yes, I am looking for the second observation in each group-first observation in each group. Your code produced one observation. I just need to generalize it for the rest of the obs. Essentially, the output above should have NA where the positive values are. – Heather Ba Jul 15 '18 at 22:49
  • have a look at the link which @ArtemSokolov provided. Please provide example data. Also, you may want to generalise your question. It helps to understand a question when using simple variable names such as 'a' and 'b' – tjebo Jul 15 '18 at 23:00
  • 1
    You might be running into some masking issues between `stats::lag()` and `dplyr::lag()`, as discussed [in another question](https://stackoverflow.com/questions/43772134/dplyr-how-to-lag-by-group). Try replacing `pred - lag(pred)` with `pred - dplyr::lag(pred)`, which explicitly uses the correct function. – Artem Sokolov Jul 15 '18 at 23:01
  • Another potentially relevant question: https://stackoverflow.com/questions/28235074/dplyr-lead-and-lag-wrong-when-used-with-group-by – Artem Sokolov Jul 15 '18 at 23:02
  • Lastly, an alternative to `x - lag(x)` is `diff(x)`. I just tried it on some mock data, and it seems to work fine: `dydx %>% group_by( ... ) %>% mutate( dydx2 = c(NA, diff(pred)) )` – Artem Sokolov Jul 15 '18 at 23:09
  • I tried replacing the code with pred - dplyr::lag(pred), but got the same result. – Heather Ba Jul 15 '18 at 23:10
  • using %>% mutate( dydx2 = c(NA, diff(pred)) ) gave me the exact same output. – Heather Ba Jul 15 '18 at 23:12
  • Hmm... can you please put together an [MCVE](https://stackoverflow.com/help/mcve)? Without exemplar data, it's hard to say where the problem is. – Artem Sokolov Jul 15 '18 at 23:13
  • Working on it. The other thing I tried was to create a new grouping variable that was the interaction of the two factor variables I am trying to group by, and group by that single new variable instead, but I still got the same result. There are numbers where there should be NAs. – Heather Ba Jul 15 '18 at 23:16
  • May you had loaded `plyr` along with `dplyr`. After the group_by step try `%>% dplyr::mutate(dydx2 =` – akrun Jul 15 '18 at 23:27
  • No, its not that. I only have dplyr loaded. – Heather Ba Jul 15 '18 at 23:34
  • I added MCVE code above. – Heather Ba Jul 15 '18 at 23:35
  • I thought I figured it out. But I didn't. I submitted an answer, but it isn't correct. I thought using arrange() was necessary but it isn't. – Heather Ba Jul 15 '18 at 23:46
  • Seems like this may be related: https://stackoverflow.com/questions/28235074/dplyr-lead-and-lag-wrong-when-used-with-group-by – Tyler Smith Jul 16 '18 at 00:37

1 Answers1

1

So when I take your verifiable code as:

require(dplyr)
newobject4 <- ND %>% group_by(Policy_num, presidentnumber ) %>% mutate(dydx2 = c(NA, diff(pred)))

newobject4
# A tibble: 12 x 4
# Groups:   Policy_num, presidentnumber [6]
   presidentnumber Policy_num   pred dydx2
   <fct>           <fct>       <int> <int>
 1 Reagan          Agriculture     1    NA
 2 Reagan          Agriculture     2     1
 3 Reagan          Social          3    NA
 4 Reagan          Social          4     1
 5 Bush            Agriculture     5    NA
 6 Bush            Agriculture     6     1
 7 Bush            Social          7    NA
 8 Bush            Social          8     1
 9 Clinton         Agriculture     9    NA
10 Clinton         Agriculture    10     1
11 Clinton         Social         11    NA
12 Clinton         Social         12     1

And then:

require(plyr); require(dplyr)
newobject4 <- ND %>% group_by(Policy_num, presidentnumber ) %>% mutate(dydx2 = c(NA, diff(pred)))
newobject4
# A tibble: 12 x 4
# Groups:   Policy_num, presidentnumber [6]
   presidentnumber Policy_num   pred dydx2
   <fct>           <fct>       <int> <int>
 1 Reagan          Agriculture     1    NA
 2 Reagan          Agriculture     2     1
 3 Reagan          Social          3     1
 4 Reagan          Social          4     1
 5 Bush            Agriculture     5     1
 6 Bush            Agriculture     6     1
 7 Bush            Social          7     1
 8 Bush            Social          8     1
 9 Clinton         Agriculture     9     1
10 Clinton         Agriculture    10     1
11 Clinton         Social         11     1
12 Clinton         Social         12     1

The suggestion above in the comments that you may have loaded plyr prior to dplyr could be true, and could be indirectly true. There may be another package dependent on plyr that you have loaded prior to dplyr. To fix this use:

newobject4 <- ND %>% group_by(Policy_num, presidentnumber ) %>% dplyr::mutate(dydx2 = c(NA, diff(pred))) 
# A tibble: 12 x 4
# Groups:   Policy_num, presidentnumber [6]
   presidentnumber Policy_num   pred dydx2
   <fct>           <fct>       <int> <int>
 1 Reagan          Agriculture     1    NA
 2 Reagan          Agriculture     2     1
 3 Reagan          Social          3    NA
 4 Reagan          Social          4     1
 5 Bush            Agriculture     5    NA
 6 Bush            Agriculture     6     1
 7 Bush            Social          7    NA
 8 Bush            Social          8     1
 9 Clinton         Agriculture     9    NA
10 Clinton         Agriculture    10     1
11 Clinton         Social         11    NA
12 Clinton         Social         12     1
akash87
  • 3,876
  • 3
  • 14
  • 30