2

I am trying to subset a data table by the numeric values so I can perform a five number summary on the numeric variables only. However, I also need to group the variables. The way that I was trying to do it does not allow me to use the subset and the id variable that is not part of the subset. I know that data table has the .SD command, but I cannot seem to get the right combination of apply functions and group in data table. The id variable is not numeric and cannot be coerced into being numeric; it is also not unique in my data table.

Here is what I have tried:

library(data.table)
library(magrittr)

dt <- data.table(num1 = rep(1, 10), 
      num2 = rep(2, 10), 
      num3 = rep(100, 10), 
      id = c("1a", "2b", "2h", "3b", "4b", "5b", "5b", "7n", "8mn", "9y"), 
      char1 = rep("a", 10), 
      char2 = rep("b", 10))

numeric_variables <- 
  lapply(dt, is.numeric) %>% 
  unlist() %>% 
  as.vector()

dt[, numeric_variables, with = FALSE]

dt_summary <- 
  apply(dt[, numeric_variables, with = FALSE][, grep("num", 
                                                 names(dt[, numeric_variables, with = FALSE]), 
                                                                                value = TRUE), 
                                          with = FALSE],  
 2,
 fivenum)  %>% 
 as.data.frame() 

rownames(dt_summary) <- 
  c("Min", "Q1", "Med", "Q3", "Max")

dt_summary

dt[, .(numeric_variables, id), with = FALSE]

The final line does not work because id is not in the numeric_variables category I created. If someone could direct me to using the correct by, tapply function with .SD I would appreciate it.

NOTE: This is part of a larger program where the user can either select one id to look at or compare two id variables at once. So, it needs to work for either one or many groups (eventually).

dc3
  • 188
  • 1
  • 12
  • 1
    Your code is not reproducible due to lack of loading required libraries, at first look I see that at least two different packages were used, it is nice to have that in *reproducible* example. – jangorecki Dec 23 '15 at 16:47
  • @jangorecki - thanks; edited. – dc3 Dec 23 '15 at 17:56
  • `data.table` doesn't support row names, [for good reasons](http://stackoverflow.com/q/24199533/817778) – eddi Dec 23 '15 at 19:37

1 Answers1

3

i guess you are looking for .SDcols

ind  <- sapply(dt, is.numeric)
(dt_summary <- dt[,lapply(.SD, fivenum), .SDcols = ind])

Which gives you

   num1 num2 num3
1:    1    2  100
2:    1    2  100
3:    1    2  100
4:    1    2  100
5:    1    2  100

As data.table does not accept rownames you can do:

setDF(dt_summary)
rownames(dt_summary) <- 
  c("Min", "Q1", "Med", "Q3", "Max")

> dt_summary
    num1 num2 num3
Min    1    2  100
Q1     1    2  100
Med    1    2  100
Q3     1    2  100
Max    1    2  100
Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • 1
    @dc3 There is really no good reason to use `by` within data.table. If you need grouping, use `data.table`'s `by` parameter. – Roland Dec 23 '15 at 18:50
  • @Roland `dt[, lapply(.SD, FUN = summary), .SDcols = ind, by = id]` – dc3 Dec 23 '15 at 18:59