0

I have some results that I put in a data frame. I have some factor columns and many numeric columns. I can easily convert the numeric columns to numeric with indexing, as per the answer to this question.

#create example data
df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]
df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

#find columns that are factors
factornames = c("X1", "X2", "X3")
factorfilt = names(df) %in% factornames

#convert non-factor columns to numeric
df[, !factorfilt] = as.numeric(as.character(unlist(df[, !factorfilt])))

But when I want to do the same for my factor columns, I cant get the same indexing to work:

#convert factor columns to factor
df[, factorfilt] = as.factor(as.character(unlist(df[, factorfilt])))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(as.character(df[, factorfilt]))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(unlist(df[, factorfilt]))
class(df$X1)

[1] "character"

df[, factorfilt] = as.factor(df[, factorfilt]) 

Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

All of these return "character" if I call class(df$X1), while if I run df$X1= as.factor(df$X1) it returns "factor".

Why does indexing this way not work when I call as.factor, but does if I call as.numeric?

Leo
  • 1,757
  • 3
  • 20
  • 44
  • 2
    The `as.factor` or `as.character` etc works on a `vector` and not on `data.frame`. You need to loop through the columns and then do `factor` – akrun Aug 16 '17 at 12:28
  • Isnt that why `unlist` is in there? – Leo Aug 16 '17 at 12:29
  • 1
    Following akrun's comment, use `lapply` to run through the selected columns and perform the coercion: `df[, factorfilt] <- lapply(df[, factorfilt], as.factor)`. – lmo Aug 16 '17 at 12:30
  • I'm pretty sure there is a way to do this with indexing, since it also works for `as.numeric` – Leo Aug 16 '17 at 12:30
  • 3
    `numeric` is an atomic data type, while factor is a special data type that maps labels to integers and has it's own class. A `data.frame` has its own class, an attempt to reassign the values in `as.factor(unlist(df[, factorfilt]))` (which is a factor) into multiple columns of the data.frame, cause the function to convert the unlisted vector to character before reassignment. The function involved is pretty complicated, type `\`[<-.data.frame\`` to take a look. The line that strips the factor class and converts to character may be `value <- matrix(value, n, p)`, since a matrix has its own class. – lmo Aug 16 '17 at 12:55
  • Thanks for explaining it so clearly. So this is just not possible? Would be a good answer. – Leo Aug 16 '17 at 12:57
  • 1
    In general, `mutate_if` is a great way to convert all columns of one type to another. – Andrew Brēza Aug 16 '17 at 12:58
  • Yes that looks very convenient. I also found that in the bottom answer to the question below. It just gets so kungfoosing to use different functions for the same operation, thats why I wanted to do it the same way as advised for the numeric columns. I will make a comment to the answer that I lnked to at the top of my post and say that the advised solution is not generally applicable. https://stackoverflow.com/questions/20637360/convert-all-data-frame-character-columns-to-factors – Leo Aug 16 '17 at 13:02
  • 1
    A pretty detailed response on trying to have factor matrices: https://stackoverflow.com/questions/28723059/can-we-get-factor-matrices-in-r. From that answer, when you use `matrix(...)` it uses `as.vector()` on the data before building the matrix. This is what converts factors to characters (try `class(as.vector(factor(c(1,2,3))))`) – Mike H. Aug 16 '17 at 13:07

1 Answers1

2

You should observe some behavioral aspects of what you are doing. Defining your data as you did:

df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]
df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

factornames = c("X1", "X2", "X3")
factorfilt = names(df) %in% factornames
df[, !factorfilt] = as.numeric(as.character(unlist(df[, !factorfilt])))

Now let's take a look at the result of making the X1, X2, and X3 factors as you did, but let's not reassign it yet.

test <- as.factor(as.character(df[, factorfilt]))
class(test) # "factor"
length(test) # 3

The important thing to notice here is that test is not a data frame. It's a vector, that you are attempting to save over three columns of a data frame. I think we should question the wisdom of converting a data frame to a vector to store in a data frame.

Then consider your second assignment:

test2 <- as.factor(as.character(unlist(df[, factorfilt])))
class(test2) # factor
length(test2) # 3000

Again, it's a factor, but it has a completely different length than test. R is being kind by letting you reassign this back into df at all, and is only doing so because it recognizes that it can reconcile the dimensions. But when you try to push the factors into X1, X2, and X3, there's a big question about what to do with the factor levels. Should all three variables have the same levels? Should each variable only have the levels present within itself? Instead of attempting to declare what the "appropriate" choice is, R just ignores it and converts it back to a character for you to deal with on your own.

The fact that manipulating columns this way has the potential to change classes unexpectedly is a good reason not to do it. This is evident in your assignment of the NAs. Let's revisit:

df = data.frame(replicate(1000,sample(1:10,1000,rep=TRUE)))
df$X1 = LETTERS[df$X1]
df$X2 = LETTERS[df$X2]
df$X3 = LETTERS[df$X3]

At this point, X4 through X1000 are all integer class columns. When you run

df[-1] <- sapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

They are all now characters, and you proceed to convert them to numeric. They aren't even their original class anymore.

If, instead, we use lapply

df[-1] <- lapply(df[-1], function(x) ifelse(runif(length(x)) < 0.1, NA, x))

the original classes are preserved and there's no need to convert them back to a numeric class. Similarly, we can readily convert X1 through X3 to factors with

df[, factorfilt] <- lapply(df[, factorfilt], as.factor)

As a general rule, it is better to manipulate the data in columns as distinct columns. Once you begin assigning a single vector over multiple columns, you enter a dark world of mischief.

Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • Hmm I was not aware of those things at all, thanks. Reading about `sapply` and `lapply` it seems they are the same though? – Leo Aug 16 '17 at 13:08
  • 2
    There's a difference in what they return. `sapply` returns either a vector or a matrix (in this particular case, I think it's a really long vector). `lapply` returns a list, which will keep the columns of your data frame properly partitioned. – Benjamin Aug 16 '17 at 13:12