1

I have the following code:

a <- c(1,2,2,3,4,5,6)
b <- c(4,5,6,7,8,8,9)
data <- data.frame(cbind(a,b))
trial <- copy(data)
for (j in 1: ncol(trial)) {
  for (i in 2: nrow(trial)) {
  if (trial[i,j] == trial[i-1,j] & !is.na(trial[i,j]) & !is.na(trial[i-1,j]))  {
     trial[i,j] <- trial[i-1,j] + (0.001*sd(trial[,j], na.rm = T))
    }
 }
}

The code perfectly works, but on a larger dataset is a bit slow. I thought to improve speed by using either the apply or the outer family. The issues are:

  1. I know how to apply a single loop with apply, but not as for 2, especially in this case, where I need to replace single values according to case-specific conditions, with another single value (the lag) plus a multiplier of the standard deviation (which is something I need to compute over the whole column;
  2. Except for this solved question, I have no experience at all of using outer and vectorised functions instead of loops.
Community
  • 1
  • 1
Mino
  • 55
  • 10
  • 1
    Try `library(data.table);f <- function(x)ifelse(x==shift(x), x + 0.001* sd(x, na.rm = TRUE), x);setDT(data)[, lapply(.SD, f), ]` – Khashaa May 02 '16 at 15:46
  • @Khashaa can u explain me a bit of you solution? i'm a newbie.. the function is clear, what about the rest? [..] and the .SD – Mino May 02 '16 at 17:38
  • You can learn about data.table here https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html – Khashaa May 03 '16 at 01:19
  • @Khashaa thank you very much! if you want to put it as an aswer, that is what I was looking for – Mino May 03 '16 at 08:05

2 Answers2

1

With data.table

library(data.table)
f <- function(x)ifelse(x==shift(x), x + 0.001* sd(x, na.rm = TRUE), x)
setDT(data)[, lapply(.SD, f), ]

With dplyr

library(dplyr)
f <- function(x)ifelse(x==lag(x), x + 0.001* sd(x, na.rm = TRUE), x)
data %>%
  mutate_each(funs(f))
Khashaa
  • 7,293
  • 2
  • 21
  • 37
0

Does this work for you?

a <- c(1,2,2,3,4,5,6)
b <- c(4,5,6,7,8,8,9)
data <- data.frame(cbind(a,b))
trial <- data.frame(a,b)
for (j in 1: ncol(trial)) {
# Finds matching rows and add a single row shift in the results
# (diff returns n-1 elements and we want n elements) 
  matching<-!c(TRUE, diff(trial[,j]))
  trial[matching,j]<- data[matching,j]+(0.001*sd(trial[,j], na.rm = T))
}

I vectorized the inner loop, this should have a significant improvement on performance. I didn't test what would happen to the sd calculation if there were multiple matching rows.
I will leave it to others to improve this revision. The use of data.table could have additional benefits.

Dave2e
  • 22,192
  • 18
  • 42
  • 50