R Summarize Collapsed Data.Table

Question

I have data such as this

data <- data.table(
 "School" = c(1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 
              1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0),
 "Grade"  = c(0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 
              0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0),
 "CAT"    = c(1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 
              0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1),
 "FOX"    = c(1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 
              1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0),
 "DOG"    = c(0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 
              0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1)
)

and wish to achieve a new data table such as this:

dataWANT <- data.frame(
  "VARIABLE" = c('CAT', 'CAT', 'CAT', 'FOX', 'FOX', 'FOX', 'DOG', 'DOG', 'DOG'),
  "SCHOOL" = c(1, 1, 0, 1, 1, 0, 1, 1, 0),
  "GRADE"  = c(0, 1, 1, 0, 1, 1, 0, 1, 1),
  "MEAN"   = c(NA)
)

dataWANT takes the mean for CAT and FOX and DOG by SCHOOL, GRADE, and SCHOOL X GRADE when they are equal to 1.

I know how to do this one at a time but that is not good for doing this with a big data.

data[, CAT1 := mean(CAT), by = list(SCHOOL)]
data[, FOX1 := mean(FOX), by = list(GRADE)]
data[, DOG1 := mean(DOG), by = list(SCHOOL, GRADE)]
    
data$CAT2 = unique(data[SCHOOL == 1, CAT1])
data$FOX2 = unique(data[GRADE == 1, FOX1])
data$DOG2 = unique(data[SCHOOL == 1 & GRADE == 1, DOG1])

Please only use this:

data <-  data.table(
  "SCHOOL" = c(1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 
               1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0),
  "GRADE"  = c(0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 
               0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0),       
  "CAT"    = c(1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
               0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1),
  "FOX"    = c(1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1,
               1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0),
  "DOG"    = c(0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
               0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1)
)
    
    
data[, CAT1 := mean(CAT), by = list(SCHOOL)]
data[, CAT2 := mean(CAT), by = list(GRADE)]
data[, CAT3 := mean(CAT), by = list(SCHOOL, GRADE)]

data[, FOX1 := mean(FOX), by = list(SCHOOL)]
data[, FOX2 := mean(FOX), by = list(GRADE)]
data[, FOX3 := mean(FOX), by = list(SCHOOL, GRADE)]

data[, DOG1 := mean(DOG), by = list(SCHOOL)]
data[, DOG2 := mean(DOG), by = list(GRADE)]
data[, DOG3 := mean(DOG), by = list(SCHOOL, GRADE)]


dataWANT <- data.frame(
  "VARIABLE" = c('CAT', 'CAT', 'CAT', 'FOX', 'FOX', 'FOX', 'DOG', 'DOG', 'DOG'),
  "TYPE"     = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
  "MEAN"     = c(0.48, 0.44, 0.428, 0.6, 0.611, 0.6428, 0.52, 0.61, 0.6428)
)

where:
TYPE equals to 1 when MEAN in estimated by SCHOOL,
TYPE equals to 2 when MEAN is estimated by GRADE,
TYPE equals to 3 when MEAN is estimated by SCHOOL and GRADE

do you need `melt(data, id.var = c('School', 'Grade'))[, .(MEAN = mean(value == 1)) , .(School, Grade, variable)]` — akrun, May 05 '20 at 19:56
@akrun thank you please use this data under 'Please only use this:' thank you it includes full reproducible desired output — bvowe, May 05 '20 at 20:06
@akrun thank you yes--i wish to create this code automatically. i showed that i learned how to do this but it is not very efficient because i have 100+ variables so an automated way to do this is ideal, where I can feed a vector of colnames. the end goal is to have dataWANT from data — bvowe, May 05 '20 at 20:15
@akrun i apologize i did not clarify that. i updated the post. — bvowe, May 05 '20 at 20:34

Ronak Shah · Answer 1 · 2020-05-05T13:02:45.830

1

Do you mean to get something like this?

library(data.table)

melt(data, measure.vars = c('CAT', 'FOX', 'DOG'))[, 
        .(MEAN = mean(value, na.rm = TRUE)), .(School, Grade, variable)]

To group it by different columns, we can do :

cols <- c('CAT', 'FOX', 'DOG')
data1 <- melt(data, measure.vars = cols)
list_cols <- list('School', 'Grade', c('School', 'Grade'))

lapply(list_cols, function(x)  
         data1[, .(MEAN = mean(value, na.rm = TRUE)), c(x, 'variable')])

edited May 05 '20 at 13:02

answered May 05 '20 at 10:44

Ronak Shah

377,200
20
156
213

yes I think that is it. My question is what if I want to do it by School, and then by Grade, and then by Both? Also how to input a vector of VAR names instead of listing in melt function? – bvowe May 05 '20 at 12:02
1

@bvowe Updated the answer, is that what you meant? – Ronak Shah May 05 '20 at 13:03
yes that is perfect, then to extract the different lists together and put into a dataframe I tried this x=data1[[1]][School == 1] y=data1[[2]][Grade == 1] z=data1[[3]][School == 1 & Grade == 1] Z = cbind(x,y,z) but how I can get it to look like dataWANT in example? – bvowe May 05 '20 at 13:58
1

I am not sure if I understand correctly. Do you just mean to bind the rows together from `lapply`? `dplyr::bind_rows(lapply(list_cols, function(x) data1[, .(MEAN = mean(value, na.rm = TRUE)), c(x, 'variable')])) `? – Ronak Shah May 05 '20 at 14:43
thank you so much that is close, I updated the question with the exact data output desired please have a look if you can and let me know if that is helpful to clarify – bvowe May 05 '20 at 18:41

score 1 · Accepted Answer · answered May 05 '20 at 20:34

We could use rbindlist after creating a list by taking the MEAN after melting the dataset (as in the other post)

library(data.table)
cols <- c('CAT', 'FOX', 'DOG')
data1 <- melt(data, measure.vars = cols)
list_cols <- list('SCHOOL', 'GRADE', c('SCHOOL', 'GRADE'))
lst1 <- lapply(list_cols, function(x)  
       data1[, .(MEAN = mean(value, na.rm = TRUE)), c(x, 'variable')])
rbindlist(lapply(lst1, function(x)  {
     nm1 <- setdiff(names(x), c('variable', 'MEAN'))
     x[Reduce(`&`, lapply(mget(nm1), as.logical)),
     .(VARIABLE = variable, MEAN)]}), idcol = 'TYPE')[order(VARIABLE)]
#   TYPE VARIABLE      MEAN
#1:    1      CAT 0.4800000
#2:    2      CAT 0.4444444
#3:    3      CAT 0.4285714
#4:    1      FOX 0.6000000
#5:    2      FOX 0.5555556
#6:    3      FOX 0.6428571
#7:    1      DOG 0.5200000
#8:    2      DOG 0.6111111
#9:    3      DOG 0.6428571

another perfect solution thank you so much. i just one more question for today if you are interested, https://stackoverflow.com/questions/61623465/r-data-table-compare-groups-simulteanous — bvowe, May 05 '20 at 21:33
@bvowe thanks, in that question you have proportion values, chisq test compares the count — akrun, May 05 '20 at 21:36

rg255 · Answer 3 · 2020-05-05T21:16:54.520

You could subset and calculate your means first using lapply(.SD,...) then melt that into your output:

melt(data[School != 0 | Grade != 0, lapply(.SD, mean), by = .(School, Grade)], id.vars = c("School", "Grade"))

Adding this after also adds the TYPE variable

...][, TYPE := School + (2*Grade)]

Putting it all together and tidying it up too, it matches your desired output

dataWANT <- melt(data[School != 0 | Grade != 0, lapply(.SD, mean), by = .(School, Grade)], id.vars = c("School", "Grade"))[, TYPE := School + (2*Grade)][order(variable, TYPE), .("VARIABLE" = variable, TYPE, "MEAN" = value)]

R Summarize Collapsed Data.Table

3 Answers3

Linked