4

Here's some working code to illustrate my question:

# Categorical variable recorded as numeric (integer)
df1 <- data.frame(group = c(1, 2, 3, 9, 3, 2, 9, 1, 9, 3, 2))

I have a categorical variable (group) recorded as integer values. For plots and to include this variable in models, it would be useful to have it encoded as factor, mapping each number to a label describing the category. So I crete a factor:

# Make it a factor
df1$group_f <- factor(x = df1$group, 
                      levels = c(1, 2, 3, 9), 
                      labels = c("G1", "G2", "G3", "Unknown"))

df1
   group group_f
1      1      G1
2      2      G2
3      3      G3
4      9 Unknown
5      3      G3
6      2      G2
7      9 Unknown
8      1      G1
9      9 Unknown
10     3      G3
11     2      G2

Now, the problem is that eventually I need the original values again (because I have to join tables based on this variable, and the other table has the original numbers for each category -1,2,3,9- and not the labels).

Converting to numeric does not work ("Unknown" category gets mapped to 4 instead of 9)

# And back to numeric
df1$group_num <- as.numeric(df1$group_f)

df1

   group group_f group_num
1      1      G1         1
2      2      G2         2
3      3      G3         3
4      9 Unknown         4
5      3      G3         3
6      2      G2         2
7      9 Unknown         4
8      1      G1         1
9      9 Unknown         4
10     3      G3         3
11     2      G2         2

?factor says:

as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

But as.numeric over the levels does not work either ('cause levels now are character with the labels, so cannot be coerced to numeric):

> as.numeric(levels(df1$group_f))
[1] NA NA NA NA
Warning message:
NAs introduced by coercion 

Is there a way to create a factor variable, so that it preserves the original values? (1,2,3,9 in this example)???

Note: the idea is to have one single factor variable that has the labels describing the categories, and the original number underlying. Although in this example I keep the variable group along the newly created factor variable, in my real use case I would/can not do that (it is a huge dataset).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
elikesprogramming
  • 2,506
  • 2
  • 19
  • 37
  • You basically wiped out the information by assigning different labels. You can see this by looking at dput's return values: `dput(df1) structure(list(group = c(1, 2, 3, 9, 3, 2, 9, 1, 9, 3, 2), group_f = structure(c(1L, 2L, 3L, 4L, 3L, 2L, 4L, 1L, 4L, 3L, 2L), .Label = c("G1", "G2", "G3", "Unknown"), class = "factor")), .Names = c("group", "group_f" ), row.names = c(NA, -11L), class = "data.frame")` – IRTFM Sep 29 '16 at 20:50
  • Factors are stored as integers starting with 1 in R, so there's no way to go back if you specify other labels, aside from making a separate variable. – alistaire Sep 29 '16 at 20:51
  • If you had not assigned new labels than you could ahve recoverd teh 9's with `as.numeric(as.character(df1$group_f))` and this is discussed in the R-FAQ. – IRTFM Sep 29 '16 at 20:54
  • so what is the point in having levels and labels? – elikesprogramming Sep 29 '16 at 21:05
  • 1
    @elikesprogramming Rather awkward workaround: you can make factor from your variable with: `factor(x = df1$group, levels = 1:9, labels = c("G1", "G2", "G3", 4, 5, 6, 7, 8, "Unknown"))` In this case you can return your original numeric values with `as.numeric`. Also there is a lot of packages with labels support for R. – Gregory Demin Sep 29 '16 at 22:00
  • 1
    @elikesprogramming what about using a lookup table? – bouncyball Sep 30 '16 at 01:21
  • thx, I like @GregoryDemin 's "awkward workaround". I was trying to find a solution using base functions only. Using other packages, `lfactors` does exactly what I want, but it says "an lfactor both uses more memory than a factor and is, in some ways, more limited than a factor". I haven't looked into how much more memory uses and why exactly it is more limited than a factor (an lfactor object has both classes, `factor` and `lfactor`, so perhaps the only limitation is "levels must be numeric and the labels must be either not castable as numeric or equal to the levels when cast as numeric"). – elikesprogramming Sep 30 '16 at 06:27

1 Answers1

1

If you keep the levels and labels vectors used to create the factor, you can use those to work backwards from the factor label to get back to the value.

group_levels <- c(1, 2, 3, 9)
group_labels <- c("G1", "G2", "G3", "Unknown")
df1$reconstituted_group_num <- group_levels[as.numeric(df1$group_f)]

This works because the index value from the labels vector lines up with the index value in the levels vector: Unknown has index 4, and so does its level 9.

Chris
  • 820
  • 7
  • 22
  • duplicate of https://stackoverflow.com/questions/27680093/converting-factor-nominal-variables-into-numeric-in-r?rq=1 – Chris Sep 13 '21 at 20:10
  • I'll add that through some convoluted experience I learned cut() is OK, but CSV files will hurt. Besides the other limitations of CSV files, the levels don't get saved with factors when you do this. When you read the CSV back in, all seems good, but the levels are re-created from the unique strings and indexed from a list in alphabetical order. This will cause problems if you originally created the factors with a list that has a different order. – Chris Sep 15 '21 at 20:34