R double for loop: outer or apply?

Question

I have the following code:

a <- c(1,2,2,3,4,5,6)
b <- c(4,5,6,7,8,8,9)
data <- data.frame(cbind(a,b))
trial <- copy(data)
for (j in 1: ncol(trial)) {
  for (i in 2: nrow(trial)) {
  if (trial[i,j] == trial[i-1,j] & !is.na(trial[i,j]) & !is.na(trial[i-1,j]))  {
     trial[i,j] <- trial[i-1,j] + (0.001*sd(trial[,j], na.rm = T))
    }
 }
}

The code perfectly works, but on a larger dataset is a bit slow. I thought to improve speed by using either the apply or the outer family. The issues are:

I know how to apply a single loop with apply, but not as for 2, especially in this case, where I need to replace single values according to case-specific conditions, with another single value (the lag) plus a multiplier of the standard deviation (which is something I need to compute over the whole column;
Except for this solved question, I have no experience at all of using outer and vectorised functions instead of loops.

Try `library(data.table);f <- function(x)ifelse(x==shift(x), x + 0.001* sd(x, na.rm = TRUE), x);setDT(data)[, lapply(.SD, f), ]` — Khashaa, May 02 '16 at 15:46
@Khashaa can u explain me a bit of you solution? i'm a newbie.. the function is clear, what about the rest? [..] and the .SD — Mino, May 02 '16 at 17:38
You can learn about data.table here https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html — Khashaa, May 03 '16 at 01:19
@Khashaa thank you very much! if you want to put it as an aswer, that is what I was looking for — Mino, May 03 '16 at 08:05

score 1 · Accepted Answer · answered May 04 '16 at 00:10

1

With data.table

library(data.table)
f <- function(x)ifelse(x==shift(x), x + 0.001* sd(x, na.rm = TRUE), x)
setDT(data)[, lapply(.SD, f), ]

With dplyr

library(dplyr)
f <- function(x)ifelse(x==lag(x), x + 0.001* sd(x, na.rm = TRUE), x)
data %>%
  mutate_each(funs(f))

answered May 04 '16 at 00:10

Khashaa

7,293
2
21
37

I can't give you the point because I'm less than 15 unfortunatly – Mino May 17 '16 at 07:12
Sorry I double checked you code and there is a problem: the first observation is turned into a NA – Mino Jul 04 '16 at 09:10
add `default` value 0(or anything else) to `lag` if you are using `dplyr` as `ifelse(x==lag(x, default=0))`. – Khashaa Jul 04 '16 at 15:20
`data.table::shift` has `fill` argument. – Khashaa Jul 04 '16 at 15:21

Dave2e · Answer 2 · 2016-05-02T17:26:11.470

Does this work for you?

a <- c(1,2,2,3,4,5,6)
b <- c(4,5,6,7,8,8,9)
data <- data.frame(cbind(a,b))
trial <- data.frame(a,b)
for (j in 1: ncol(trial)) {
# Finds matching rows and add a single row shift in the results
# (diff returns n-1 elements and we want n elements) 
  matching<-!c(TRUE, diff(trial[,j]))
  trial[matching,j]<- data[matching,j]+(0.001*sd(trial[,j], na.rm = T))
}

I vectorized the inner loop, this should have a significant improvement on performance. I didn't test what would happen to the sd calculation if there were multiple matching rows.
I will leave it to others to improve this revision. The use of data.table could have additional benefits.

R double for loop: outer or apply?

2 Answers2