-2

I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.

Question is how do I slice a particular column from a data frame using a string (character).

Example:

  strColumnNames <- "Admit,Rank"
  strDelimiter <- ","
  strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
  for( strColName in strSplittedColumnNames[[1]] ){
    dfData$as.name(strColName) <- factor(dfData$get(strColName))
  }

Tried:

dfData$as.name()
dfData$get(as.name())
dfData$get()

Error Msg: Error: attempt to apply non-function

Any help would be greatly appreciated! Thank you!!!

AiRiFiEd
  • 311
  • 2
  • 12
  • 1
    I didnt know about the tick and thanks for your guidance. It seems the tick is very important to users - pretty scary here. – AiRiFiEd Oct 09 '16 at 03:59

2 Answers2

3

You need to change

dfData$as.name(strColName) <- factor(dfData$get(strColName))

to

dfData[[strColName]] <- factor(dfData[[strColName]])

You may read ?"[[" for more.

In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:

dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"

dat$z
# [1] NULL

dat[[z]]
# [1] 1 2 3 4 5

Regarding the other answer

apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:

x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)

str(x)
#'data.frame':  4 obs. of  3 variables:
# $ x1: chr  "a" "b" "c" "d"
# $ x2: chr  "A" "B" "C" "D"
# $ x3: int  1 2 3 4

Note, you still have character variable rather than factor. As I said, we have to use lapply:

x[1:2] <- lapply(x[1:2], as.factor)

str(x)
#'data.frame':  4 obs. of  3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int  1 2 3 4

Now we see the factor class in x1 and x2.

Using apply for a data frame is never a good idea. If you read the source code of apply:

    dl <- length(dim(X))
    if (is.object(X)) 
    X <- if (dl == 2L) 
        as.matrix(X)
    else as.array(X)

You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.

Yet apply is written in R not C, with an ordinary for loop:

 for (i in 1L:d2) {
        tmp <- forceAndCall(1, FUN, newX[, i], ...)
        if (!is.null(tmp)) 
            ans[[i]] <- tmp

so it is no better than an explicit for loop you write yourself.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
0

I would use a different method.

Create a vector of column names you want to change to factors:

factorCols <- c("Admit", "Rank")

Then extract these columns by index:

myCols <- which(names(dfData) %in% factorCols)

Finally, use apply to change these columns to factors:

dfData[,myCols] <- lapply(dfData[,myCols],as.factor)
greghk
  • 91
  • 4
  • hey greghk, is there any reason behind the choice of this method vs that proposed by Zheyuan? thank you! – AiRiFiEd Sep 17 '16 at 15:31
  • @ZheyuanLi you're right, lapply would be better for reproducibility in the future. My point is that the code could be made more concise and easy to understand by using an *apply function rather than a for loop – greghk Sep 17 '16 at 17:30
  • @ZheyuanLi I've edited to reflect you're points about lapply – greghk Sep 17 '16 at 17:54
  • hey greghk! Thanks for your inputs as well! Let me test both codes out over the coming weekend! really appreciate both your help! – AiRiFiEd Sep 19 '16 at 15:45