-1

I've been reading how to improve code in R taking a look a some of the answers here and also reading a bit of the R inferno document. Now I have this problem and the loop I created seems to be taking forever (15 hours and counting).

k <- NROW(unique(df$EndStation.Id))
l <- NROW(unique(df$StartStation.Id))
m1 <- as.matrix(df[,c("Duration","StartStation.Id","EndStation.Id")])
g <- function(m){
    for (i in 1:l){
        for (j in 1:k){
            duration <- m[(m[,2]==i & m[,3]==j),1]
            if (NROW(duration)<=1) {
                m[(m[,2]==i & m[,3]==j),1] <- NA
                next
        }
        duration <- duration/median(duration)
        m[(m[,2]==i & m[,3]==j),1] <-  duration
        }
    }
return(m)
}

answer <- g(m1)

The number of Stations (Start and End) is both 750 and the duration vector size can vary a lot from 1 or 2 to 80. Is this loop improbable or should I give up and try to get access to a faster computer.

Best regards, Fernando

  • 2
    Welcome to StackOverflow! Please read up on [how to ask a good question](https://stackoverflow.com/help/how-to-ask) and make your example [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Thomas K Jun 17 '17 at 10:16
  • Your loop runs 750x750 times, but you are using vector operations not scalar. – Marichyasana Jun 17 '17 at 10:29

2 Answers2

0

The code is a bit hard to read, but I think this is what you want to do:

library(data.table)
## generate a data table
dt <- setDT(df[,c("Duration","StartStation.Id","EndStation.Id")])
## calculate the duration
dt[, Duration := Duration / median(Duration), by = .(StartStation.Id, EndStation.Id)]
## replace the result with NA when the vector length == 1
dt[, N := .N, by = .(StartStation.Id, EndStation.Id)][
    N == 1, Duration := NA
    ][, N := NULL]
amatsuo_net
  • 2,409
  • 11
  • 20
  • This looks so easy, I will definitely look more into data.table. Thanks for the answer. The only small difference is that I want to replace with NA when vector length is 1 or less but that was to avoid errors in the calculations and to be able to easily remove them later. So ideally I would edit that second part to get rid of those – Fer Muñoz Méndez Jun 17 '17 at 11:10
0

If I understand your function correctly, you want to divide the duration between two stations by it median duration and if there is only one entry for the pair of stations set to NA

Here is a base solution (it's a bit clunky, I haven't finished my first cup of coffee):

##Some sample data
df <- data.frame(StartStation.Id=sample(LETTERS[1:10], 100, replace =T),
                 EndStation.Id=sample(LETTERS[11:20], 100, replace =T),
                 Duration=runif(100, 0.1,100))
    res <- tapply(df$Duration, paste0(df$StartStation.Id, df$EndStation.Id), function(x) x/median(x))
    res <- data.frame(StartStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 1),
                      EndStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 2),
                      durn=unlist(res))
res[res$durn==1,] <- NA
emilliman5
  • 5,816
  • 3
  • 27
  • 37