I'll divide this question into two parts, being the first a general question, and the second a specific one.
First - I would like to know if there a is a possible way to label numeric factors but still keep its original numeric levels. This is specially confusing since I realised that when we pass a label argument to a factor, it then becomes this factor's levels, for example:
x<- factor(c(1,2,3, 2, 3, 1, 2), levels = c(1, 2, 3), labels = c("a", "b", "c"))
levels(x)
#[1] "a" "b" "c"
labels(x)
#[1] "1" "2" "3" "4" "5" "6" "7"
I would like to know if there is a way, like it does in Stata, to label the categories of a factor. I want to be able to sum x while its elements show as "a, "b or "c", but keep the value 1, 2, or 3.
Second- I'm asking this because I have a very large data set which has columns with numeric categories. This data set comes with a dictionary in xlsx which I read and treat into R, so each column has its numeric categories and their respective labels. I'm attempting to read the dictionary, create a list of categories and labels inside a list of columns and then read the data set, loop through the columns and label the variables. These labels are important so I don't have to look at the dictionary every time I have to interpret something on the data set. And the numeric levels are important because since I have a lot of dummy variables (yes or no variables) I want to be able to sum them.
Here's my code (I use the data.table package):
dic<- readRDS(dictionary_filename)
# Reading data set #
data <- fread(dataset_filename, header = T, sep = "|", encoding = "UTF-8", na.strings = c("NA", ""))
# Treating the data.set #
# Identifying which lines of the dictionary have categorized variables. This is very specific to my dictionary strcture #
index<- which(!is.na(dic$num.categoria))
# storing the names of columns that have categorized variables #
names_var<- dic$`Var name`[index]
names_var<- names_var[!is.na(names_var)]
# Creating a data frame with categorized variables which will be later split into lists #
df<- as.data.frame(dic[index,])
# Transforming the index column to factor so it is possible to split the data frame into a list with sublists for each categorized column #
df$N<- as.factor(df$N)
# Splitting the data frame to list
lst<- split(df, df$N)
# Creating a labels list and a levels list #
lbs<- list()
lvs<- list()
for (i in 1:length(lst)){
lbs[[i]]<- as.vector(lst[[i]]$category)
lvs[[i]]<- as.vector(lst[[i]]$category.number)
}
# Changing the data set columns into factors with ther respective levels and labels #
k<- 1
for (var in names_var){
set(data, j =var, value = factor(data[[var]], levels = lvs[[k]], labels = lbs[[k]]))
k<- k +1
}
I realize the code is a bit abstract since i don't provide the data set or the dictionary, but it is just so you could have an idea. My code works, it runs with no error and it does what I hoped it would do (all the categorized columns are now showing their labels, for example, "yes" or "no" when before it was 1 or 0). Except from the fact that I can no longer access the original numbers in levels, which I need to in the next part of my project.
It would be preferable if there is a general way of doing so, since I run this code in a function, with many columns with different data sets and different dictionaries. Is there a way to accomplish this?
PS.: I have read the documentation in R and the answers to those questions:
Factor, levels, and original values
Having issues using order function in R
But unfortunately I wasn't able to figure it out by myself, it just became obvious that using the "labels" argument in "factor" was not the way to get it done.
Thank you so much!