-1

I converted iris data set to data.table format.My aim was to take mean of all columns and group them by Species using data.table.

DT <-as.data.table(iris)

Below is my desired output

DT[,.(mean(Sepal.Length),mean(Sepal.Width),mean(Petal.Length),mean(Petal.Width)),by =.(Species)]
      Species    V1    V2    V3    V4
1:     setosa 5.006 3.428 1.462 0.246
2: versicolor 5.936 2.770 4.260 1.326
3:  virginica 6.588 2.974 5.552 2.026

But typing all the column names using the above manner when we have large number of column is time confusing. I tried this using the below command.But the output is arranged in a slightly different manner

DT[, .(vapply(DT[,!'Species',with=FALSE],mean,FUN.VALUE =1)),by = .(Species)]
               Species       V1
         1:     setosa 5.843333
         2:     setosa 3.057333
         3:     setosa 3.758000
         4:     setosa 1.199333
         5: versicolor 5.843333
         6: versicolor 3.057333
         7: versicolor 3.758000
         8: versicolor 1.199333
         9:  virginica 5.843333
        10:  virginica 3.057333
        11:  virginica 3.758000
        12:  virginica 1.199333

is there any way to use above kind of things and get rid of typing all the column names just to take out the mean and group them by species using data.table

Please Do not suggest to use 'with = FALSE' . I already knew that.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Think this might be a duplicate, but `DT[, lapply(.SD,mean), by=Species, .SDcols=names(DT)[1:4]]` or similar should sort you out. – thelatemail Jun 16 '16 at 10:54
  • Possibly a duplicate of http://stackoverflow.com/questions/16783598/r-datatable-apply-a-function-to-a-subset-of-columns until I can find an exact match. – thelatemail Jun 16 '16 at 10:56
  • that wasn't helpful for me. Can you perform it ? –  Jun 16 '16 at 11:17
  • @learner what do you actually want if `.SDcols` is not helpful? – akrun Jun 16 '16 at 11:17
  • 4
    @learner - it gives exactly the same output as your desired output doesn't it? In fact it can be slightly simplified to `DT[, lapply(.SD,mean), by=Species]` if you just want to exclude the `by=` variable. – thelatemail Jun 16 '16 at 11:19
  • 3
    Have you gone through the [Introduction to data.table](https://github.com/Rdatatable/data.table/wiki/Getting-started) vignette? Specifically section 2e. – Arun Jun 16 '16 at 11:28
  • @learner feed `.SDcols` the vector of column names using negation. Here is one method: `names(iris)[names(iris) != "Species"]`. I'm pretty sure there are cleverer ways. – lmo Jun 16 '16 at 11:57
  • To expand this subsetting, just add names, say you don't want to calculate the mean of Sepal.Length as well, then feed `.SDcols` `names(iris)[!(names(iris) %in% c("Species", "Sepal.Length"))]`. – lmo Jun 16 '16 at 12:04
  • @learner you can self answer the question to not leave it unresolved. – jangorecki Jun 17 '16 at 11:32

1 Answers1

0
iris <- as.data.table(iris)
iris <- iris[,lapply(.SD,mean), by=Species]

Desired Output:-

structure(list(Species = structure(1:3, .Label = c("setosa", 
"versicolor", "virginica"), class = "factor"), Sepal.Length = c(5.006, 
5.936, 6.588), Sepal.Width = c(3.428, 2.77, 2.974), Petal.Length = c(1.462, 
4.26, 5.552), Petal.Width = c(0.246, 1.326, 2.026)), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"), .Names = c("Species", 
"Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), 
.internal.selfref = <pointer: 0x0000000003fd0788>)

Suppose you want mean of only "Sepal.Length" and "Sepal.Width" columns. You can do that by slight modification as follows:-

iris <- iris[,lapply(.SD,mean), by=Species, .SDcols = c("Sepal.Length", "Sepal.Width")]

In data.table, DT[i,j,by], j must return a list and lapply returns list

SriM
  • 11
  • 4