2

I'll divide this question into two parts, being the first a general question, and the second a specific one.

First - I would like to know if there a is a possible way to label numeric factors but still keep its original numeric levels. This is specially confusing since I realised that when we pass a label argument to a factor, it then becomes this factor's levels, for example:

x<- factor(c(1,2,3, 2, 3, 1, 2), levels = c(1, 2, 3), labels = c("a", "b", "c"))
levels(x)
#[1] "a" "b" "c"
labels(x)
#[1] "1" "2" "3" "4" "5" "6" "7"

I would like to know if there is a way, like it does in Stata, to label the categories of a factor. I want to be able to sum x while its elements show as "a, "b or "c", but keep the value 1, 2, or 3.

Second- I'm asking this because I have a very large data set which has columns with numeric categories. This data set comes with a dictionary in xlsx which I read and treat into R, so each column has its numeric categories and their respective labels. I'm attempting to read the dictionary, create a list of categories and labels inside a list of columns and then read the data set, loop through the columns and label the variables. These labels are important so I don't have to look at the dictionary every time I have to interpret something on the data set. And the numeric levels are important because since I have a lot of dummy variables (yes or no variables) I want to be able to sum them.

Here's my code (I use the data.table package):

dic<- readRDS(dictionary_filename)

            # Reading data set #

              data <- fread(dataset_filename, header = T, sep = "|", encoding = "UTF-8", na.strings = c("NA", ""))

            # Treating the data.set #

                # Identifying which lines of the dictionary have categorized variables. This is very specific to my dictionary strcture #

                  index<- which(!is.na(dic$num.categoria))

                # storing the names of columns that have categorized variables #

                  names_var<- dic$`Var name`[index]
                  names_var<- names_var[!is.na(names_var)]

                # Creating a data frame with categorized variables which will be later split into lists #

                  df<- as.data.frame(dic[index,])          
                # Transforming the index column to factor so it is possible to split the data frame into a list with sublists for each categorized column #      
                  df$N<- as.factor(df$N)     
                # Splitting the data frame to list      
                  lst<- split(df, df$N)      
                # Creating a labels list and a levels list #     
                  lbs<- list()                      
                  lvs<- list()
                        for (i in 1:length(lst)){        
                      lbs[[i]]<- as.vector(lst[[i]]$category)
                      lvs[[i]]<- as.vector(lst[[i]]$category.number)              
                  }      
                # Changing the data set columns into factors with ther respective levels and labels  #      
                  k<- 1      
                  for (var in names_var){        
                      set(data, j =var, value = factor(data[[var]], levels = lvs[[k]], labels = lbs[[k]]))        
                      k<- k +1
                  }

I realize the code is a bit abstract since i don't provide the data set or the dictionary, but it is just so you could have an idea. My code works, it runs with no error and it does what I hoped it would do (all the categorized columns are now showing their labels, for example, "yes" or "no" when before it was 1 or 0). Except from the fact that I can no longer access the original numbers in levels, which I need to in the next part of my project.

It would be preferable if there is a general way of doing so, since I run this code in a function, with many columns with different data sets and different dictionaries. Is there a way to accomplish this?

PS.: I have read the documentation in R and the answers to those questions:

Factor, levels, and original values

Having issues using order function in R

But unfortunately I wasn't able to figure it out by myself, it just became obvious that using the "labels" argument in "factor" was not the way to get it done.

Thank you so much!

Community
  • 1
  • 1
bprallon
  • 41
  • 1
  • 5
  • 2
    For your first part, I suspect that actually all the information you want is already there in the factor, but that you're confused about the function `labels()` which I don't think does or represents what you think. i.e. if you run `as.integer(x)` aren't those the original numeric values you want preserved? – joran Nov 09 '16 at 21:38
  • Hey joran, if I try to use the labels argument then the original numbers are not preserved, therefore as.integer(x) returns an error, because the previous levels (numeric) are substituted by the labels argument (character), so R tries to convert "a", "b" and "c" to integer, which isn't possible. – bprallon Nov 10 '16 at 14:17
  • 1
    `as.integer(x)` will never directly return an error when `x` is a factor; any error you're seeing must be downstream in your own code that expects something that isn't happening. Factors in R are stored internally with a consecutive sequence of integer codes; you have _zero_ flexibility on that. The underlying integers in a factor will always be 1,2,3,etc with no gaps. Period. Then you can control how that sequence of consecutive integers are "labelled"... – joran Nov 10 '16 at 15:30
  • The reason there are separate `levels` and `labels` arguments is because the vector you pass may not have every "level" in it. e.g. `factor(1:3,levels = 1:4,labels = letters[2:5])`. `levels` allows you to notify R that there are some categories you want "space" for that aren't in the original set, (also for ordering purposes). `labels` is just for choosing what to call those levels, rather than the default, which would be character representations of the levels. – joran Nov 10 '16 at 15:33
  • Beyond walking you through how factors work, though, I'm not sure I help more without a better explanation of what you're trying to do, because to me at least, it isn't very clear. – joran Nov 10 '16 at 15:34
  • Thank you for clearing that out... it was helpful. I'll edit this post later to try to explain it better. – bprallon Nov 10 '16 at 15:40

0 Answers0