Gettings head() and tail() where NA is present for many variables

Question

I have dozens of variables, many of which have missing values, including at the first and last observation. I want a new dataset that contains, per person, the first and last observation for each variable, ignoring missings.

The below code does it, but I was hoping 1) there was some function that is similar to head(), but without having to remove NAs manually, 2) a way to write a function that dplyr's summarize_each() could use to automate across all variables in the dataset (other than id, of course)

set.seed(23331)
df <- data.frame(id=rep(c(1,2,3,4), each = 5),
                 a = c(NA, rnorm(4), rnorm(3), rep(NA, 2), rnorm(4), rep(NA, 5), rnorm(1)),
                 b = c(rep(NA, 2), rnorm(14), rep(NA, 3), rnorm(1)))
df %>% group_by(id) %>% summarise(a.head=head(a[!is.na(a)], n=1), 
                                  a.tail=tail(a[!is.na(a)], n=1),
                                  b.head=head(b[!is.na(b)], n=1), 
                                  b.tail=tail(b[!is.na(b)], n=1)) %>% 
  gather("type", "value", -id) %>% 
  separate(type, into = c("variable", "time"), sep = "\\.") %>% 
  spread(variable, value)

I'm hoping for a dplyr solution but would take a base or data.table solution if one of those is the optimal way of going about it.

Desired Output:

Source: local data frame [8 x 4]

     id  time          a          b
  (dbl) (chr)      (dbl)      (dbl)
1     1  head -0.5877282  0.4975612
2     1  tail -0.7904277 -0.3860010
3     2  head  0.5872134 -0.3923887
4     2  tail -0.3222003  0.3114662
5     3  head -0.2553290  0.7521095
6     3  tail  0.3095699 -0.9113326
7     4  head -0.3809334  1.4752274
8     4  tail -0.3809334  3.2767918

You can reduce a couple of steps `df %>% group_by(id) %>% summarise_each(funs(Head= head(.[!is.na(.)], n=1), Tail= tail(.[!is.na(.)], n=1))) %>% gather(Var, Val, -id) %>% separate(Var, c('Variable', 'time')) %>% spread(Variable, Val)` — akrun, Nov 05 '15 at 13:16
(1) `na.omit(x)` is slightly more readable than `x[!is.na(x)]`; (2) why not write helper functions (e.g. `hn1 <- function(x) head(na.omit(x),1)`) for further readability? — Ben Bolker, Nov 05 '15 at 13:17

akrun · Accepted Answer · 2015-11-05T13:39:01.037

2

We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we loop through the Subset of Data.table (lapply(.SD,..) and head the headand tail of each column.

library(data.table)
f1 <- function(x, n) {x1 <- x[!is.na(x)]; c(head(x1,n), tail(x1,n))}
setDT(df)[,lapply(.SD, f1, n=1) ,id][, time:= c('head', 'tail')][]

Or use melt/dcast

 DT <- setDT(df)[,melt(lapply(.SD, function(x) list(head=head(x[!is.na(x)],1),
              tail=tail(x[!is.na(x)],1)))) ,id]
 dcast(DT, id+L2~L1, value.var='value')

edited Nov 05 '15 at 13:39

answered Nov 05 '15 at 13:18

akrun

874,273
37
540
662

2

I think you could just define something like `Myfunc <- function(x) { tmp <- x[!is.na(x)] ; c(tmp[1L], tmp[length(tmp)])}` and then could use it in both as `setDT(df)[, lapply(.SD, Myfunc), by = id]` and `df %>% group_by(id) %>% summarise_each(funs(Myfunc))`. You still need to add the `time` var btw. – David Arenburg Nov 05 '15 at 13:23
@DavidArenburg You are right. In the first option, I manually generated the 'time', but `melt/dcast` should automatically get the column. – akrun Nov 05 '15 at 13:25

score 2 · Answer 2 · answered Nov 05 '15 at 13:35

dplyr is not designed for transformations that lead to a number of rows other than 1 or n().

To stay in that world, you can use the (as far as I've seen) inefficient do:

library(magrittr)
ht_nona = . %>% na.omit %>% { c(first(.), dplyr::last(.)) }

df %>% group_by(id) %>% do( as.data.frame(lapply(., ht_nona)) )

Another (arguably even worse) option would be to summarise twice and bind rows:

bind_rows(
  df %>% group_by(id) %>% summarise_each(funs(. %>% na.omit %>% first)),
  df %>% group_by(id) %>% summarise_each(funs(. %>% na.omit %>% (dplyr::last)))
)

I realized this doesn't have the head/tail column after posting it; and think it is too much trouble to shoehorn in. Hopefully the message is clear enough: dplyr seems ill-suited to this. — Frank, Nov 05 '15 at 14:31

score 1 · Answer 3 · edited May 23 '17 at 12:06

A variation on @akrun's answer, again with data.table:

library(data.table)

setDT(df)[, c(
  list(time=c("head","tail")), 
  lapply(.SD, function(v) setDT(list(v))[!is.na(V1)][c(1,.N), V1] )
), by=id]

   id time          a          b
1:  1 head -0.5877282  0.4975612
2:  1 tail -0.7904277 -0.3860010
3:  2 head  0.5872134 -0.3923887
4:  2 tail -0.3222003  0.3114662
5:  3 head -0.2553290  0.7521095
6:  3 tail  0.3095699 -0.9113326
7:  4 head -0.3809334  1.4752274
8:  4 tail -0.3809334  3.2767918

setDT(list(v)) borrowed from @eddi.

score 1 · Answer 4 · answered Nov 05 '15 at 15:39

The downside is that this one requires three packages

set.seed(23331)
df <- data.frame(id=rep(c(1,2,3,4), each = 5),
                 a = c(NA, rnorm(4), rnorm(3), rep(NA, 2), rnorm(4), rep(NA, 5), rnorm(1)),
                 b = c(rep(NA, 2), rnorm(14), rep(NA, 3), rnorm(1)))

library('base')
library('utils')
library('stats')

data.frame(id = rep(1:4, each = 2), time = c('head', 'tail'), 
           sapply(df[, -1], function(x) unlist(tapply(x, df$id, FUN = function(y)
             c(head(na.omit(y), 1), tail(na.omit(y), 1))))))

#    id time          a          b
# 11  1 head -0.5877282  0.4975612
# 12  1 tail -0.7904277 -0.3860010
# 21  2 head  0.5872134 -0.3923887
# 22  2 tail -0.3222003  0.3114662
# 31  3 head -0.2553290  0.7521095
# 32  3 tail  0.3095699 -0.9113326
# 41  4 head -0.3809334  1.4752274
# 42  4 tail -0.3809334  3.2767918

You might want to give a hint as to the joke (that all those packages are included in the base installation). — Frank, Nov 05 '15 at 16:00

Gettings head() and tail() where NA is present for many variables

4 Answers4