0

I'm trying to apply a custom function to a nested dataframe

I want to apply a machine learning algorithm to predict NA values

After doing a bit of reading online, it seemed that the map function would be the most applicable here

I have a section of code that nests the dataframe and then splits the data into a test (data3) and train (data2) set - with the test dataset containing all the null values for the column to be predicted, and the train containing all the values that are not null to be used to train the ML model

dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
                  group_by(dma) %>%
                  nest() %>%
                  mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
                         data3 = map(data, ~filter(., is.na(mean_night_flow))))

Here is the function I intend to use:

    my_function (test,train) {
             et  <- extraTrees(x = train, y = train[, "mean_night_flow"], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
             test1 <- test
             test1[ , "mean_night_flow"] <- 0
             pred  <- predict(et, newdata = test1[, "mean_night_flow"])
             test1[ , "mean_night_flow"] <- pred
             return(test1)

I have tried the following code, however it does not work:

dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
                      group_by(dma) %>%
                      nest() %>%
                      mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
                             data3 = map(data, ~filter(., is.na(mean_night_flow))),
                             data4 = map(data3, data2, ~my_function(.x,.y)))

It gives the following error:

Error: Index 1 must have length 1, not 33

This is suggests that it expects a column rather than a whole dataframe. How can I get this to work?

Many thanks

MGJ-123
  • 614
  • 4
  • 19
  • Hi MGJ, it will be much easier to help if you provide at least a sample of your data with `dput(dmaExtendedDataNA)` or `dput(dmaExtendedDataNA[1:20,])`. You can edit your question and paste the output. You can surround it with three backticks (```) for better formatting. See [How to make a reproducible example](https://stackoverflow.com/questions/5963269/) for more info. – Ian Campbell Jun 01 '20 at 14:55
  • 1
    `lapply( data, function )` is used to apply functions to nested lists. – Daniel O Jun 01 '20 at 14:57

1 Answers1

2

Without testing on your data, I think you're using the wrong map function. purrr::map works on one argument (one list, one vector, whatever) and returns a list. You are passing it two values (data3 and data2), so we need to use:

dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
                      group_by(dma) %>%
                      nest() %>%
                      mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
                             data3 = map(data, ~filter(., is.na(mean_night_flow))),
                             data4 = map2(data3, data2, ~my_function(.x,.y)))

If you find yourself needing more than two, you need pmap. You can use pmap for 1 or 2 arguments, it's effectively the same. The two biggest differences when migrating from map to pmap are:

  • your arguments need to be enclosed within a list, so

    map2(data3, data12, ...)
    

    becomes

    pmap(list(data3, data12), ...)
    
  • you refer to them with double-dot number position, ..1, ..2, ..3, etc, so

    ~ my_function(.x, .y)
    

    becomes

    ~ my_function(..1, ..2)
    

An alternative that simplifies your overall flow just a little.

my_function (test, train = NULL, fld = "mean_night_flow") {
  if (is.null(train)) {
    train <- test[ !is.na(test[[fld]]),, drop = FALSE ]
    test <- test[ is.na(test[[fld]]),, drop = FALSE ]
  }
  et  <- extraTrees(x = train, y = train[, fld], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
  test1 <- test
  test1[ , fld] <- 0
  pred  <- predict(et, newdata = test1[, fld])
  test1[ , fld] <- pred
  return(test1)
}

which auto-populates train based on the missingness of your field. (I also parameterized it in case you ever need to train/test on a different field.) This changes your use to

dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
                      group_by(dma) %>%
                      nest() %>%
                      mutate(data4 = map(data, ~ my_function(.x, fld = "mean_night_flow")))

(It's important to name fld=, since otherwise it will be confused for train.)

If you're planning on reusing data2 and/or data3 later in the pipe or analysis, then this step is not necessarily what you need.

Note: I suspect your function in under-tested or incomplete. The fact that you assign all 0 to your test1[,"mean_night_flow"] and then use those zeroes in your call to predict seems suspect. I might be missing something, but I would expect perhaps

  test1 <- test
  pred  <- predict(et, newdata = test1)
  test1[ , fld] <- pred
  return(test1)

(though copying to test1 using tibble or data.frame is mostly unnecessary, since it is copied in-place and the original frame is untouched; I would be more cautious if you were using class data.table).

r2evans
  • 141,215
  • 6
  • 77
  • 149