Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info

Question

I have a large dataset with observational survey data which I would like to aggregate to country-year level (also for factors), in order to use the data as country-level data in another dataset. One df that I would like to aggregate has the following classes:

character  labelled   numeric 
       24       272        50

Where I am pretty sure the labelled class is the result of the Hmisc library.

I started out as follows, which worked quite well.

dfsum <- df %>%
 group_by(countryyear) %>%
 summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first (.)))

Surprisingly this leaves me with 244/346 variables (I have no clue why it would be that number, any explanation would be great).

I would like to include as many columns as possible in the dfsum. I realise that for un-ordered factors that would not provide any useful info, but it will for the ordered factors. For binary variables the value between 0 and 1 would for example give me the size of each category and the ordinal variables are often scales. I tried to do:

dfsum <- df%>%
 group_by(countryyear) %>%
 summarise_all(funs(if(is.numeric(.)|is.factor(.)) mean(., na.rm = TRUE) else first (.)))

But that did not really do anything (not add any extra variables).

More importantly I would like in the summarization process like to retain the factor information. Is it possible to somehow reattach that information in a different way? For example that it was a binary value (perhaps if more than 50% of the original variable was either 0 or 1), or add the scale (by taking the min and the max of the original variable)?

May be some class/attributes are not working well with the summarise_all. You could remove those attributes and try again `df %>% as.data.frame %>% group_by(countryyear)` — akrun, Aug 29 '18 at 15:00
Thanks.. Would there be an easy way to store class/attributes and add it back after doing the operation? — Tom, Aug 30 '18 at 05:50
If class attributes are that important, may be you can do these outside tidyverse with `base R` so that it won't have any clashes — akrun, Aug 30 '18 at 13:42

Tom · Accepted Answer · 2018-09-13T11:28:03.307

0

By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:

#1
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
#2
df[] = lapply(df, as.numeric.factor)
#3
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
#4
dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]

1, 2, 3, 4

edited Sep 13 '18 at 11:28

answered Aug 31 '18 at 13:30

Tom

2,173
1
17
44

Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info

1 Answers1

1, 2, 3, 4