2

I would like an efficient function or code snippet that tries to subset a vector, and returns NA if there are no elements in the subset. For example, for

v1 = c(1, 1, NA)

The code unique(v1[!is.na(v1)]) returns one entry which is great, but for

v2 = c(NA, NA, NA)

the code unique(v2[!is.na(v2)]) returns logical(0) which is not great, when this subsetting operation is used as part of a dplyr chain containing summarise_each or summarise. I would like the second operation to return NA instead of logical(0).

The context behind this is that I am trying to solve this question using multiple spread commands. Example data taken from the previous question:

set.seed(10)
tmp_dat <- data_frame(
    Person = rep(c("greg", "sally", "sue"), each=2),
    Time = rep(c("Pre", "Post"), 3),
    Score1 = round(rnorm(6, mean = 80, sd=4), 0),
    Score2 = round(jitter(Score1, 15), 0),
    Score3 = 5 + (Score1 + Score2)/2
)

> tmp_dat
Source: local data frame [6 x 5]

  Person  Time Score1 Score2 Score3
   <chr> <chr>  <dbl>  <dbl>  <dbl>
1   greg   Pre     80     78   84.0
2   greg  Post     79     80   84.5
3  sally   Pre     75     74   79.5
4  sally  Post     78     78   83.0
5    sue   Pre     81     78   84.5
6    sue  Post     82     81   86.5

Now, using multiple spreads we can achieve the desired output (albeit with different column names):

tmp_dat %>%
    mutate(Time_2 = Time,
           Time_3 = Time) %>%
    spread(Time, Score1, sep = '.') %>%
    spread(Time_2, Score2, sep = '.') %>%
    spread(Time_3, Score3, sep = '.') %>%
    group_by(Person) %>%
    summarise_each(funs(((function(x)x[!is.na(x)])(.))))

Now, the problem arises if there are too many NA's:

# Replace last two entries in the last row with NA's
tmp_dat$Score2[6] <- NA 
tmp_dat$Score3[6] <- NA 

Now running the code snippet with the summarise_each produces the error:

Error in eval(substitute(expr), envir, enclos) : expecting a single value
Community
  • 1
  • 1
Alex
  • 15,186
  • 15
  • 73
  • 127
  • 1
    If you know your line always returns just one value, just add `[1]` at the end: `unique(v2[!is.na(v2)])[1]`. Otherwise, just define your own function: `uniqueNotNA<-function(x) {ind<-!is.na(x);if (sum(ind)==0) NA else unique(x[ind])}`. – nicola Nov 04 '16 at 05:53
  • Thanks. Is this efficient though? I like the [1] at the end – Alex Nov 04 '16 at 22:47

1 Answers1

1

This can be easily done with dcast from data.table which can take multiple value.var columns

library(data.table)
dcast(setDT(tmp_dat), Person ~paste0("Time.", Time), 
                 value.var = c("Score1", "Score2", "Score3"))
#     Person Score1_Time.Post Score1_Time.Pre Score2_Time.Post Score2_Time.Pre Score3_Time.Post Score3_Time.Pre
#1:   greg               79              80               80              78             84.5            84.0
#2:  sally               78              75               78              74             83.0            79.5
#3:    sue               82              81               NA              78               NA            84.5

If we need to use dplyr/tidyr, an option would be to gather the 'Score' columns to 'long' format, unite columns to a single column ('Time1') and then do the spread

library(dplyr)
library(tidyr)
gather(tmp_dat, Var, Val, Score1:Score3) %>% 
           mutate(TimeN = 'Time', Var = sub("\\D+", "", Var)) %>%
           unite(Time1, TimeN, Time, Var) %>% 
           spread(Time1, Val)
# # A tibble: 3 × 7
#   Person Time_Post_1 Time_Post_2 Time_Post_3 Time_Pre_1 Time_Pre_2 Time_Pre_3
# *  <chr>       <dbl>       <dbl>       <dbl>      <dbl>      <dbl>      <dbl>
#1   greg          79          80        84.5         80         78       84.0
#2  sally          78          78        83.0         75         74       79.5
#3    sue          82          NA          NA         81         78       84.5
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks @akrun. However, if i try to do other `summarise` operations, possibly returning empty vectors, `summarise` will still fail. I would love to be able to return a placeholder in these situations. – Alex Nov 04 '16 at 05:06
  • @Alex In the `dcast`, there is `fun.aggregate` which you use. – akrun Nov 04 '16 at 05:07
  • @Alex I updated with a dplyr solution, but if you are looking for some `summarise` solutions, then the example should be different – akrun Nov 04 '16 at 05:19
  • Sorry, the second example was only to show context in which I would like such a function to exist. The actual example is in the first part of the question. I have added a problem statement in bold, hope that is clearer. – Alex Nov 04 '16 at 05:24