Omitting NAs and empty data frames with map_if

Question

I'm working with a tibble like below:

ex <- structure(list(rowid = c(4L, 5L, 6L, 9L, 10L), timestamp = structure(c(1502480694.03336, 
1502480695.44736, 1502480696.03336, 1502480703.99836, 1502480706.19936
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), cat = c(32L, 
1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
    var2 = c(0, 50, 29.7, 51, 70.8), var3 = c(NA, 26.3, 24, 20.5, 
    12), order = c(NA, 1L, 1L, 1L, 1L), bfr = list(NA, structure(list(
        rowid = integer(0), timestamp = structure(numeric(0), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = integer(0), var1 = structure(integer(0), .Label = "1", class = "factor"), 
        var2 = numeric(0), var3 = numeric(0), order = integer(0)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = integer(0)), structure(list(
        rowid = 5L, timestamp = structure(1502480695.44736, class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = 1L, var1 = structure(NA_integer_, .Label = "1", class = "factor"), 
        var2 = 50, var3 = 26.3, order = 1L), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -1L)), structure(list(
        rowid = 5:8, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = c(1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44), var3 = c(26.3, 24, 8.9, 
        12.4), order = c(1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -4L)), structure(list(
        rowid = 5:9, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336, 
        1502480703.99836), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
        cat = c(1L, 1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44, 51), var3 = c(26.3, 24, 
        8.9, 12.4, 20.5), order = c(1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -5L)))), row.names = c(4L, 
5L, 6L, 9L, 10L), class = "data.frame")

I want to summarise nested tibbles in column bfr with map. To omit unnecessary calculations, I want to go with map_if which would skip the row when bfr contains less than 2 rows with cat == 1. However due to presence of NAs and empty tibbles in bfr column, I'm struggling with writing appropriate predicate function. Here's my attempt:

more_than <- function(df){
  if (nrow(df) == 0 | is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

ex %>% 
  mutate(mean_var2 = map_if(bfr, more_than, 
                            ~.x %>% summarise(mean_var2 = mean(var2))))

which results in:

Error in if (nrow(df) == 0 | is.na(df)) return(FALSE) : argument is of length zero

How can I deal with the presence of both NAs and empty tibbles to write appropriate predicate function?

The issue is with `is.na(df)`, which does the NA check on the whole data while nrow is a single output — akrun, Apr 16 '19 at 05:49
Also, in the `more_than`, you are doing some other calculation which you are not getting as output in the `mean_var2` — akrun, Apr 16 '19 at 05:57
Sorry, I don't get your first comment - could you possibly elaborate in your answer? `more_than` is just a predicate to avoid unneccesary calculations for some elements of `bfr` column. — jakes, Apr 16 '19 at 06:00

akrun · Answer 1 · 2019-04-16T07:01:57.300

2

If the intention is to get the mean of 'var2' column, check the list elements are either data.frame or tibble (in this case it is a tibble) and then do the summarise

out <-  ex %>% 
           mutate(mean_var2 = map_if(bfr, is.tibble, ~ 
             .x %>% 
                summarise(mean_var2 = mean(var2, na.rm = TRUE))))

If we also need to check sum(cat ==1) > 2

more_than <- function(df){
i1 <- is_tibble(df)
if(i1) {
   n <- df %>% 
    summarise(v1 = sum(cat == 1))  %>%
    pull(v1) 
    }

    i1 && (n > 2)


}
ex %>%
  mutate(mean_var2 = map_if(bfr, more_than, ~
      .x %>%
         summarise(mean_var2 = mean(var2, na.rm = TRUE))))

The reason why is.na is not working is because it checks for each dataset and in some of them it is a tibble and this returns a logical matrix, while if/else expects a single TRUE/FALSE to return. For e.g.

(3 == 4) & (cbind(3:5, 1:3) == 3)

yields a different output

One option is to use &&, so that it evaluates the rhs condition only if the first condition is TRUE and thereby avoiding unncessary evaluation

(3 == 4) && (cbind(3:5, 1:3) == 3)
#[1] FALSE

In the OP's original function, if we replace the | with || it should work fine

more_than <- function(df){
  if (nrow(df) == 0 || is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

Update

If we want to return NA for those cases that are not met

ex %>%
    mutate(mean_var2 = map_dbl(bfr, ~ 
    if(is_tibble(.x) && sum(.x$cat == 1) > 2) mean(.x$var2, na.rm = TRUE) else NA))

Or another option is to use possibly (similar to tryCatch)

posmean <- possibly(function(dat) if(sum(dat$cat == 1) > 2) 
     mean(dat$var2, na.rm  = TRUE) else NA_real_, otherwise = NA_real_)
ex %>% 
     mutate(mean_var2 = map_dbl(bfr, posmean))

edited Apr 16 '19 at 07:01

answered Apr 16 '19 at 05:52

akrun

874,273
37
540
662

Then, we don't check for additional condition regarding `cat == 1`, which results in massive overload on the original data. – jakes Apr 16 '19 at 05:54
@jakes Here, we are interested in the column 'var2', so, it is more direct to check whether it is a data.frame or not – akrun Apr 16 '19 at 05:55
Mean of `var2` is used as surogate in order to avoid showing original calculations to be performed which involve number of elements in nested data frames and at the same time, making this post unnecessarily long. Checking if a tibble contains more than 2 values of 1 in `cat` column is crucial, so more complicated predicate function is needed. – jakes Apr 16 '19 at 05:58
@jakes Are you saying that only if sum(cat) > 2, then the `mean` should be calculated or else `NA – akrun Apr 16 '19 at 06:00
Yes, that is exactly what I meant here. – jakes Apr 16 '19 at 06:01
@jakes I updated the post. Note that `map_if` skips the calculation on those list elements that are not TRUE – akrun Apr 16 '19 at 06:07
1

Great! It works now. I wonder why `is.na` didn't work though. Could you possibly elaborate on your first comment under my post in some spare time? Thanks! – jakes Apr 16 '19 at 06:15
@jakes I added some explanation – akrun Apr 16 '19 at 06:19
Thanks! I noticed something else - if condition `n > 2` is not met, `mean_var2` results in original tibble and not NA as intended (see row 3). Any idea how to correct that without checking if `bfr == mean_var2` at the end? – jakes Apr 16 '19 at 06:31
@jakes I noticed that. It is the reason I said `map_if` skips the calculation – akrun Apr 16 '19 at 06:36
1

`ex %>% mutate(mean_var2 = map_if(bfr, more_than, ~.x %>%summarise(mean_var2 = mean(var2,na.rm = TRUE)),.else = NA_integer_))` define `.else` will do the trick, `.else A function applied to elements of .x for which .p returns FALSE.` – A. Suliman Apr 16 '19 at 06:39
1

@jakes You can use `ex %>% mutate(mean_var2 = map_dbl(bfr, ~ if(is_tibble(.x) && sum(.x$cat == 1) > 2) mean(.x$var2, na.rm = TRUE) else NA))` – akrun Apr 16 '19 at 06:40
1

@A.Suliman That is a nice option. I didn't know that. YOu should post it as an answer – akrun Apr 16 '19 at 06:48
@A.Suliman I tested it, but it is giving me `NULL` instead of `NA`. any thoughts? – akrun Apr 16 '19 at 06:49
1

@akrun we need a function for `.else`, I tested with `dplyr::first` and it works as expected hence I define a function like `foo <- function(x){return(NA)}`, so we endup with `ex %>% mutate(mean_var2 = map_if(bfr, more_than, ~.x %>% summarise(mean_var2 = mean(var2,na.rm = TRUE)),.else = foo)) %>% select(mean_var2)` of course you will find something better. One more option `.else = ~return(NA)` – A. Suliman Apr 16 '19 at 07:00
1

@A.Suliman Can you post as an answer, because it would be better for others as well to get more insights from this – akrun Apr 16 '19 at 07:03

score 2 · Answer 2 · answered Apr 16 '19 at 07:10

first, we need to check for NA's with || "see the difference between | and || here" before we check nrow. Then we need .else which is:

.else A function applied to elements of .x for which .p returns FALSE.

when more_than returns FLASE

more_than <- function(df){
 # browser()
  if (all(is.na(df)) || nrow(df) == 0) return(FALSE)

     n <- df %>%
       summarise(sum(cat == 1)) %>%
       as.numeric()

     return(n > 2)
}

ex %>% 
mutate(mean_var2 = map_if(bfr, more_than, 
                          ~.x %>% summarise(mean_var2 = mean(var2,na.rm = TRUE)),
                         .else = ~return(NA))) %>% 
select(mean_var2)

   mean_var2
1        NA
2        NA
3        NA
4    44.125
5      45.5

Omitting NAs and empty data frames with map_if

2 Answers2

Update