0

I've just started learning R this week, so I'm pretty bad at it. I've made up a function that recieves three parameters, and I want to perform the following operation:

for (k in 1:nrow(df_t)){
  df_t$colv[k] = link_targets(data = df,
                              target_date = df_t$mtime[k],
                              tag = tag)
}

So basically what I'm trying to do is apply a function to each element of a certain column of df_t, and the value the function returns depends on another column of that same data frame. (The function returns a scalar value).

I was wondering if this could be vectorized to avoid using the loop, which appears to be really slowing down the code.

Let me know if you need any further information to help me out with this.

EDIT:

The function I call in the loop is the following:

link_targets = function (data, target_date, tag){
  # Delete all rows that don't have the tag as name
  data[data$NAME != as.character(unlist(tag[1])),] = NA
  data = na.omit(data)
  # Delete all rows that do not correspond to the dates of the tag
  limit_time_1 = target_date - as.numeric(60 * tag[2] - 60)
  limit_time_2 = target_date - as.numeric(60 * tag[3])
  data[(data$IP_TREND_TIME < min(limit_time_1,limit_time_2))
       | (data$IP_TREND_TIME > max(limit_time_1,limit_time_2)),] = NA
  data = na.omit(data)
  mean_data = mean(as.numeric(data$IP_TREND_VALUE))
  return(mean_data)
}

I'm working with data tables. df is like this:

             NAME       IP_TREND_TIME IP_TREND_VALUE
       1: TC241-1 2018-03-06 12:05:31      194.57875
       2: TC241-1 2018-03-05 17:54:05       196.5219
       3: TC241-1 2018-03-05 05:02:18       211.4066
       4: TC241-1 2018-03-04 03:06:57      211.92874
       5: TC241-1 2018-03-03 06:41:17      205.43651
      ---                                           
13353582: DI204-4 2017-04-06 17:43:41     0.88308918
13353583: DI204-4 2017-04-06 17:43:31     0.88305187
13353584: DI204-4 2017-04-06 17:43:21     0.88303399
13353585: DI204-4 2017-04-06 17:43:11     0.88304734
13353586: DI204-4 2017-04-06 17:43:01     0.88305187

The tag array contains the word I want to look for in the column NAME, and two numbers that represent the time range I want. So for example:

     tag  start end
1 TC204-1    75 190

The output I'm looking (df_t) for would be something like this:

              TREND_TIME TREND_VALUE         colv 
  1: 2018-03-05 05:35:00   1.9300001     16.86248 
  2: 2018-03-05 02:21:00        1.95     18.04356 
  3: 2018-03-04 22:35:00        1.98     17.85405 
  4: 2018-03-04 17:01:00           2     17.87318 
  5: 2018-03-04 12:49:00        2.05     18.10455
 ---                                                      
940: 2017-04-07 15:01:00   2.1500001     20.14933 
941: 2017-04-07 09:27:00         1.9     20.19337    
942: 2017-04-07 04:46:00        1.95     20.20166    
943: 2017-04-07 01:34:00   2.0699999     20.20883    
944: 2017-04-06 21:46:00         1.9     20.15735 

Where colv contains the mean value of all the values in the column IP_TREND_VALUE corresponding to the selected tag and within the range of time determined by the numbers in tag, based on the time in TREND_TIME in df_t.

Tendero
  • 1,136
  • 2
  • 19
  • 34
  • 1
    `apply` family of functions (also `purrr::map`) aren't vectorizing given task, iteration (loop) is _hidden under the hood_. See this post [Why is apply() method slower than a for loop in R?](https://stackoverflow.com/questions/5533246/why-is-apply-method-slower-than-a-for-loop-in-r) and Hadley's answer: _The point of the apply ... functions is not speed, but expressiveness_. If you want to speed up your code I would recommend improving `link_targets` function, find which part is taking most of the time. – pogibas Apr 12 '18 at 16:27
  • @PoGibas I see, thank you for the clarification! That was very useful. – Tendero Apr 12 '18 at 16:30
  • 2
    `sapply(df_t$mtime,link_targets,data=df,tag=tag)` – Onyambu Apr 12 '18 at 16:36
  • can you give us `link_targets` function? – minem Apr 13 '18 at 07:27
  • @minem I've added the function. – Tendero Apr 13 '18 at 13:20
  • @Tendero https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example add some example data and how do you want your result to look like – minem Apr 13 '18 at 13:24
  • @minem There you go. Tell me if there's something missing to clarify. – Tendero Apr 13 '18 at 13:47

1 Answers1

1

It is hard to come up with better solution because it is hard for me to understand your logic and explanation, maybe you could create better and smaller example, where it would be more clearer what are you trying to accomplish.

But you should be able to replace link_targets function with this one:

link_targets <- function(data, target_date, tag) {
  limit_time_1 = target_date - as.numeric(60 * tag[2] - 60)
  limit_time_2 = target_date - as.numeric(60 * tag[3])
  x <- c(limit_time_1, limit_time_2)
  i1 <- data$NAME == as.character(unlist(tag[1]))
  i2 <- (data$IP_TREND_TIME >= min(x)) & (data$IP_TREND_TIME <= max(x))
  mean_data <- mean(as.numeric(data$IP_TREND_VALUE[i1 & i2]))
  return(mean_data)
}

and see great speed improvement.

Update

maybe this function will increase speed on you particular data:

link_targets2 <- function(data, target_date, tag) {
  limit_time_1 <- target_date - as.numeric(60 * tag[[2]] - 60)
  limit_time_2 <- target_date - as.numeric(60 * tag[[3]])
  x <- c(limit_time_1, limit_time_2)
  i1 <- data$NAME == as.character(unlist(tag[1]))
  xx <- data$IP_TREND_TIME[i1]
  i2 <- (xx >= min(x)) & (xx <= max(x))
  mean_data <- mean(as.numeric(data$IP_TREND_VALUE[i1][i2]))
  return(mean_data)
}
minem
  • 3,640
  • 2
  • 15
  • 29
  • That script runs 6 times faster, thanks! I guess that directly indexing the rows instead of deleting the ones that are not needed does the trick. – Tendero Apr 16 '18 at 13:13
  • @Tendero yes, in your function you were copying the data frame two times in each iteration, which is quite time consuming. Of course there could be more possible improvements. – minem Apr 16 '18 at 13:16
  • One last question. Why do you use `min(x)` and `max(x)` instead of just `limit_time_1` and `limit_time_2` to calculate `i2`? Isn't this slightly slower as you are creating an auxiliary array and calling two functions unnecessarilly? – Tendero Apr 16 '18 at 13:46
  • @Tendero because creating vector `x` does not take any time, but increases the readability of the code (less duplicates) – minem Apr 16 '18 at 13:52
  • I'll try that out, too. Why is it that you think it may run faster? – Tendero Apr 16 '18 at 14:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169089/discussion-between-minem-and-tendero). – minem Apr 16 '18 at 14:14