Vectorizing or for loop or apply function?

Question

I have a dataframe with 6 columns. Column 1 holds dates, column 2 individuals and column 3 to 6 the are used for a calcuation.

Date <- c(1, 1, 2, 2, 2, 3)
Ind <- c("a","a","a","b","c","c")
C <- c(5, 6, 5, 7, 8, 8)
D <- c(8, 8, 9, 9, 9, 9)
E <- c(8, 9, 11, 10, 9, 7)
F <- c(5, 6, 8, 5, 7, 4)

df <- data.frame(Date, Ind, C, D, E, F)

I want to perform a calculation (like (C-E)+(D-F) (in real life those are coordinates and I'm calculating distances, but that's not where my problem is right now).

I want to perform the calculations, stored in a new column (G), with a 1 day difference between where I use the value of column C and E from day 1, and the values of column E and F from day+1 for the same individual.

I'm not sure if I should use a loop or an apply function. This is what I've tried so far, a vectorized operation and subsetting, based on this thread: Loop over rows of dataframe applying function with if-statement

df$G <- NA

df[!(df$Date ==(df$Date+1)), "G"] <- ((C-E)+(D-F))

This works, but it does the calculations on the coordinates from the same row (C, D, E, F all from same row). I realize why it does this, as I don't state from which row to take the coordinates. C and D need to be taken from row where Date = Date, and E and F from row where Date = (Date+1). I realize it, but I can't get my head around it how to do that.

Continue this route? Do it in a for loop? Using apply function?

Distances are usually L1: `abs(C-D)+ abs(E-F)` or L2: `(C-D)^2 + (E-F)^2`, i.e always positive. — IRTFM, Dec 02 '14 at 06:15
@egilio Your description is a bit confusing. Could you show the expected output for this example data. — akrun, Dec 02 '14 at 07:55

Tommy O'Dell · Answer 1 · 2014-12-02T08:01:58.950

0

The dplyr package provides very nice lag and lead functions.

> library(dplyr)
> df %>% mutate(G = C + D + lead(E,1) + lead(F,1))
  Date Ind C D  E F  G
1    1   a 5 8  8 5 28
2    1   a 6 8  9 6 33
3    2   a 5 9 11 8 29
4    2   b 7 9 10 5 32
5    2   c 8 9  9 7 28
6    3   c 8 9  7 4 NA

G is NA for the last row because there is no next Date value.

Edit:

As others have mentioned, it looks like your example data has two Dates for Ind==a. You may want to be careful doing lead/lag in that situation.

If it makes sense to do so, you could aggregate them first, before doing lead/lag.

df %>% group_by(Date,Ind) %>% 
  summarise(C=mean(C),D=mean(D),E=mean(E),F=mean(F)) %>%
  ungroup %>%
  mutate(G = C + D + lead(E,1) + lead(F,1))

Which produces:

  Date Ind   C D    E   F    G
1    1   a 5.5 8  8.5 5.5 32.5
2    2   a 5.0 9 11.0 8.0 29.0
3    2   b 7.0 9 10.0 5.0 32.0
4    2   c 8.0 9  9.0 7.0 28.0
5    3   c 8.0 9  7.0 4.0   NA

edited Dec 02 '14 at 08:01

answered Dec 02 '14 at 07:38

Tommy O'Dell

7,019
13
56
69

The `Date` column is same for `3:5` and `1:2` rows. Also, it seems to be grouped by `Ind`. So, I am not sure this is correct. – akrun Dec 02 '14 at 07:45
Thanks for pointing that out. I'm assuming that that wasn't an intended feature of the data as the OP didn't mention how that would be handled. – Tommy O'Dell Dec 02 '14 at 07:55
I was looking at OP's code `df$Date ==(df$Date+1)`. May be you are right, but the example and description is confusing. – akrun Dec 02 '14 at 07:56

Vectorizing or for loop or apply function?

1 Answers1

Edit: