0

I have to remove loops of my function. This is not easy because I have difficult structure of data and I don't know how can use apply family in it.

First of all, I have this structure of data

List <- List DATA 1 <- List DATA 2

Inside it this lists, I have other lists with TRAIN and TEST. Finally, I have data.frames in theses levels. I create simul data of my list with iris dataset.

data(iris)
head(iris)

iristest<-head(iris)

train<-list(iris,iris,iris)
test<-list(iristest,iristest,iristest)

list1<-list(train,test)
names(list1)<-c("train","test")


iris2<-iris
iris2[,1:4]<-iris[,1:4]+5
iristest2<-head(iris2)

train<-list(iris2,iris2,iris2)
test<-list(iristest2,iristest2,iristest2)

list2<-list(train,test)
names(list2)<-c("train","test")

flist<-list(list1,list2)
names(flist)<-c("iris","iris2")

Now, I created a function than i want to apply of my list.

Kmax<-5
nd<-10
ks<-seq(from=1,to=Kmax,by=1)
kn<-seq(1:nd)

findKNN<-function(listdf,seeds){
  indx<-1

  outs<-matrix(0, nrow = 5*length(listdf[[1]]), ncol = 3)

  for (i in seq_along(listdf[[1]])){
    for (K in 1:5){
      train<- as.data.frame(listdf$train[i])
      test <- as.data.frame(listdf$test[i])

      set.seed(seeds)

      kpreds <- knn(train[,-ncol(train)],test[,-ncol(test)], train[,ncol(train)],k=K)
      Ktable <-table(kpreds ,test[,ncol(test)])

      outs[indx,1] <- (Ktable[1, 2] + Ktable[2, 1]) / sum(Ktable)
      outs[indx,2] <- K
      outs[indx,3] <- i
      indx<-indx+1
    }
  }

  outs<-data.frame(outs)
  names(outs)<-c("error","K","I")
  outs<-aggregate(error ~ K,outs, mean)
}

output<-lapply(flist,seeds=12345,findKNN)

But I dont know how I can run this code effienctly.

Thanks

liguang
  • 161
  • 1
  • 9
  • Agree with @Jet that it's more for readability than efficiency, but still I think it's a good idea. The place to start is to factor out the logic that's happening in the loops, then it will be more intuitive how to replace the for loop with more R-idiomatic code. – Aaron left Stack Overflow May 03 '19 at 13:18

3 Answers3

1

This is just a stab in the dark but it seems to me the reason for the two loops is that you have structured the data as lists inside a list? Possibly lists inside lists inside a list? To me this seems to be the bigger issue then the for loops not being efficient.

Just an idea, but maybe restructure how your data is stored to something like a map where you can relate values to keys. So example is you have a map with keys “list1” “list2” and all values in map are paired to their key. Then you only need one for loop with an if that says if keys match what i want take data. Just a thought.

Deja
  • 35
  • 6
  • could you explain it? I have a multiples data. For this data I have 5 dataset of train and 5 dataset of test. I thought this form will be best buf if you know something good, your answer will be welcome. – liguang May 03 '19 at 13:28
1

The apply functions actually do not have efficiency advantage over the for loops anymore, according to this thread.

If your goal is only to decrease runtime, then there may be no point converting the loops to apply functions. The advantage of these functions is now mainly to produce more readable code.

Jet
  • 650
  • 5
  • 17
1

The place to start is by factoring your code into chunks, where each new function works on each level of the data. Then you can call each piece from the other and collect the results in a more idiomatic way.

Here I made functions for 1) the core code for each train/test pair, 2) repeating that for each desired K, and 3) repeating that across the possible pairs.

I agree with @Deja that restructuring your data to a more "tidyverse" style method could result in even more intuitive code, but if you're not used to thinking in that way, this is probably clearer.

## run core code for a particular train/test pair
run1 <- function(train, test, K, seeds) {
  set.seed(seeds)  
  train <- as.data.frame(train)
  test <- as.data.frame(test)
  kpreds <- class::knn(train[, -ncol(train)],test[,-ncol(test)], train[,ncol(train)],k=K)
  Ktable <- table(kpreds ,test[, ncol(test)])
  (Ktable[1, 2] + Ktable[2, 1]) / sum(Ktable)
}

## run a particular train/test pair at several values of K
runK <- function(train, test, Ks, seeds) {
  errors <- sapply(Ks, function(K) run1(train, test, K, seeds))
  data.frame(K=Ks, error=errors)
}

## test several train/test pairs, at several values of K
findKNN <- function(df, Ks=1:5, seeds){
  stopifnot(length(df$train)==length(df$test))
  out <- lapply(seq_along(df$train), function(i) {
    cbind(i=i, runK(df$train[[i]], df$test[[i]], Ks, seeds))
  })
  out <- do.call(rbind, out)
  aggregate(error ~ K, out, mean)
}

## loop over several sets of data
output <- lapply(flist, seeds=12345, findKNN)

To put the data in a more "tidy" format, you'd have one row per test/train pair with additional columns for which data set and which rep it is. A little awkward to get there from what you started with, but here's what it would look like.

n <- sapply(lapply(flist, `[[`, "train"), length)
ftrain <- do.call(c, lapply(flist, `[[`, "train"))
ftest <- do.call(c, lapply(flist, `[[`, "test"))
nn <- rep(names(n), n)
ii <- unlist(lapply(n, function(i) seq_len(i)))
library(tidyverse)
alld <- tibble(data=nn, i=ii, train=ftrain, test=ftest)
alld
## # A tibble: 6 x 4
##   data      i train                  test                
##   <chr> <int> <list>                 <list>              
## 1 iris      1 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 2 iris      2 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 3 iris      3 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 4 iris2     1 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 5 iris2     2 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 6 iris2     3 <data.frame [150 x 5]> <data.frame [6 x 5]>

You'd then loop through each row. (Note to make this work I had to make the result of runK be a data.frame.)

out <- alld %>% mutate(error=map2(train, test, runK, Ks=1:5, seeds=12345))
out
## # A tibble: 6 x 5
##   data      i train                  test                 error               
##   <chr> <int> <list>                 <list>               <list>              
## 1 iris      1 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 2 iris      2 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 3 iris      3 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 4 iris2     1 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 5 iris2     2 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 6 iris2     3 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>

You then take out the original data, "unnest" the error data.frame, and summarize over data set and K.

out %>% select(-train, -test) %>% unnest() %>% 
  group_by(data, K) %>% summarize(error=mean(error))
## # A tibble: 10 x 3
## # Groups:   data [?]
##    data      K error
##    <chr> <int> <dbl>
##  1 iris      1     0
##  2 iris      2     0
##  3 iris      3     0
##  4 iris      4     0
##  5 iris      5     0
##  6 iris2     1     0
##  7 iris2     2     0
##  8 iris2     3     0
##  9 iris2     4     0
## 10 iris2     5     0
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • Thanks It works. Could you explain "restructuring your data to a more "tidyverse" style method"? Thanks – liguang May 03 '19 at 14:06
  • Yes it works very well thanks! I didn't know this option. The problem is I don't know if is it faster... – liguang May 03 '19 at 21:29
  • 1) Only worry about speed if it's too slow. It's often faster to let the computer do it the slow way than for you to figure out a faster way. 2) If it is too slow, make sure you know where the slow part is. In this code, it's likely that the slow part is in the knn function, and any time difference from any of these three methods would be minimal. – Aaron left Stack Overflow May 03 '19 at 22:20
  • Yes thanks! Could speak with you in private about this code? Thanks – liguang May 03 '19 at 22:34