nested map functions with purrr

Question

I need to perform knn regression with bootstrapping, and iterate for different values of K

Say I have 2 data frames, train and test

train <- read.csv("train.csv")
test <- read.csv("test.csv")

And a function knn which looks like:

knn <- function(train_data, train_label, test_data, K){

  len_train <- nrow(train_data)
  len_test <- nrow(test_data)


  test_label <- rep(0, len_test)

  k_means <- function(training_pt){

    distances <- as.matrix(dist(rbind(training_pt, train_data)))[1, (1+1):(1+len_train)]
    data.frame(y = train_label) %>%
    # train_label %>%
      mutate(pt_dist = distances) %>%
      arrange(pt_dist) %>%
      select(y) %>%
      slice(1:K) %>% pull() %>% mean()
  }

  predictions <- apply(test_data, 1, k_means)
  return(predictions)

}

where train_data takes a data frame with predictor columns, train_label is a vector of train values, and test_data is a data frame with similar columns as train_data

This function returns the predicted test labels for each row of test_data

Now, I write a function to generate boot strapped samples:

gen_boot_sample <- function(df, sample_size = 25){
  df %>% sample_n(sample_size, replace = T)
}

I managed to write something that applies the knn function over generated bootstrapped samples for a fixed value of K.

However I'm struggling with iterating over K

The idea is to generate a data frame which contains the error values of each boot strapped sample (say 20 samples) for each value of K

test_label <- test_data %>%
  select_at(.vars = vars(contains("y"))) %>%
  pull()

rerun(5, gen_boot_sample(train_data)) %>%
      map( ~ knn( 
      train_data = .x %>%
        select_at(.vars = vars(contains("x"))),
      train_label = .x %>%
        select_at(.vars = vars(contains("y"))) %>%
        pull(),
      test_data = test_data %>%
        select_at(.vars = vars(contains("x"))),
      K = 5
         )
      ) %>%
      map(~sum(. - test_label)^2)

I checked the answers at purrr map equivalent of nested for loop but am struggling given how my knn function takes argument

Edit: adding parts of data

train_data <- structure(list(x1 = c(1973.5, 1967.5, 1970.5, 1978, 1964, 1962, 
1980, 1961.5, 1976.5, 1979.5), y = c(6.57, 1.83, 3.69, 11.88, 
0.92, 0.72, 16.2, 0.92, 8.28, 14.85)), row.names = c(28L, 16L, 
22L, 37L, 9L, 5L, 41L, 4L, 34L, 40L), class = "data.frame")

test_data <- structure(list(x1 = c(1978.75, 1962.75, 1974.25, 1975.75, 1963.75, 
1972.75, 1968.25, 1980.75, 1979.25, 1970.75), y = c(8.91, 0.6, 
6.39, 6.12, 0.77, 4.41, 2.07, 11.61, 12.96, 3.6)), row.names = c(38L, 
6L, 29L, 32L, 8L, 26L, 17L, 42L, 39L, 22L), class = "data.frame")

Sorry, p2_train/test are train_data and test_data. Edited the post — rangeelo, Aug 22 '19 at 14:28
I've added sample data and the code for my implementation of knn function — rangeelo, Aug 22 '19 at 14:50
Have you removed the `train_label` `knn(train_data, train_label, test_data, K = 5) Error in eval_tidy(xs[[i]], unique_output) : object 'train_label' not found` — akrun, Aug 22 '19 at 14:52
Sorry I removed part of the code which runs knn once. Please run the last part where I map knn over genarated samples — rangeelo, Aug 22 '19 at 14:55
In addition to `tidyverse` packages, do I need to load any other packagee — akrun, Aug 22 '19 at 14:56
I can't reperoduce any error with the sample data. Have you tried with the example — akrun, Aug 22 '19 at 15:03
There are no errors. I'm just not able to iterate (with purrr) over different values of K (say 1:10). In my code the K value is fixed and I can't figure out how to iterate — rangeelo, Aug 22 '19 at 15:05
So, you neeed another loop, right. If that is the case, do a `map` over `k` — akrun, Aug 22 '19 at 15:09
Do you need `rerun(5, gen_boot_sample(train_data)) %>% map(~ {train_data <- .x %>% select_at(vars(contains('x'))); train_label = .x %>% select_at(.vars = vars(contains("y"))) %>% pull(); test_data = test_data %>% select_at(.vars = vars(contains("x"))); map_dbl(1:10, ~ {out <- knn(train_data, train_label, test_data, K = .x); sum(out - test_label)^2})})` — akrun, Aug 22 '19 at 15:14
Not directly the point, but decluttering code helps in debugging: the whole set of code you've got for making `test_label` actually just boils down to `test_data$y`. Is there a situation where you need this to scale in some way, and therefore the `select_at` might be necessary? — camille, Aug 22 '19 at 15:30
@akrun that's very helpful, thanks! I'm still trying to make sense of the syntax and your use of curly braces inside map - never seen something like this before — rangeelo, Aug 22 '19 at 15:40
@camille well I do have multiple x-vars (x1, x2, ..) and copied the code to make the syntax consistent for `y`. You're right however, thank you — rangeelo, Aug 22 '19 at 15:42
That makes sense for the x variables, although those could also be simplified with `select(.x, contains("x"))` — camille, Aug 22 '19 at 15:46
@camille I'm still a tidyverse beginner and didn't know it could be done that way. Well, TIL :) — rangeelo, Aug 22 '19 at 15:56

score 1 · Accepted Answer · answered Aug 22 '19 at 15:42

We can use another loop nested in map to run for different values of "K"

library(tidyverse)
rerun(5, gen_boot_sample(train_data)) %>%
      map(~ {
         # create the subset datasets
         train_data <- .x %>%
                           select_at(vars(contains('x')))
         train_label <- .x %>%
                          select_at(.vars = vars(contains("y"))) %>% 
                          pull()
         test_data <- test_data %>% 
                         select_at(.vars = vars(contains("x")))
        # loop over different values for 'K'
        map_dbl(1:10, ~ {
               #apply the knn function
               out <- knn(train_data, train_label, test_data, K = .x)
               sum(out - test_label)^2}
             )
      })

nested map functions with purrr

1 Answers1