I'm using the following function, grp to aggregate with data.table
and running into a problem.
The problem is that the order of the levels of the factor variable fc_x
is not keept in the same order after aggregation.
Is there a problem with my function, or is this "normal" meaning it has an explanation?
grp <- function(x) {
percentage = as.numeric(table(x)/length(x))
list(x = factor(levels(x)),
percentage = percentage,
label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
)
}
set.seed(123)
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
labels = c("0-50", "51-100", "+100"))
str(DT)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 3 variables:
# $ x : num 90.7 59.4 18 125.4 187.7 ...
# $ fac : Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ fc_x: Factor w/ 3 levels "0-50","51-100",..: 2 2 1 3 3 3 3 3 1 1 ...
levels(DT$fc_x)
# [1] "0-50" "51-100" "+100"
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "+100" "0-50" "51-100"
EDIT
Changing the "+100" for "1000" provides a similar result
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
labels = c("0-50", "51-100", "1000"))
levels(DT$fc_x)
# [1] "0-50" "51-100" "1000"
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50" "1000" "51-100"
Using ordered = TRUE in the cut() statement provides the same result
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T, ordered = T,
labels = c("0-50", "51-100", "1000"))
levels(DT$fc_x)
# [1] "0-50" "51-100" "1000"
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50" "1000" "51-100"