1

I have at disposal a clean dataframe (1500r x 297c, named 'Data' - very inspiring) with both numeric/factor columns. However, as this is often the case, my factors were encoded as numbers (each number representing a level) hence a dataframe full a numeric vectors. To overcome this matter I also have a second dataframe (VarLabels), containing information about the columns of the 1st dataframe (which has... 297 rows as you would imagine). In there, one specific column helps me defining what should be the data class in the main dataframe (named VarLabels$TypeVar).

I wrote the following piece of code, which might not be optimal but proved to work so far:

(NB: as you can see, for data labelled 'MIX' I wish to create a copy to have one numeric and one factor)

nbcol <- ncol(Data)
indexcol <- which(colnames(VarLabels) == "TypeVar")
for(i in 1:nbcol){
  if (colnames(Data)[[i]] %in% VarLabels$VarName){

    if (VarLabels[i,indexcol] == "Quant"){ 
      Data[[i]] <- as.numeric(Data[[i]])

    } else if (VarLabels[i,indexcol] == "Qual") { 
      Data[[i]] <- as.character(Data[[i]])
      Data[[i]] <- as.factor(Data[[i]])

    } else if (VarLabels[i,indexcol] == "Mix") { 
      Data <- cbind(Data, Data[[i]])
      Data[[i]] <- as.character(Data[[i]])
      Data[[i]] <- as.factor(Data[[i]])
      Data[[ncol(Data)]] <- as.numeric(Data[[ncol(Data)]])
      colnames(Data)[[ncol(Data)]] <- paste(colnames(Data)[[i]], "Num", sep = "_")

    } else {
      Data[[i]] <- as.numeric(Data[[i]])
    }
  } else {
  }  
}

Do you have a neater solution, possibly using a function to reduce the number of code lines / using names instead of column index? (which may be risky if order changes in one of the two dataframes) I recently got into R and am still struggling with user-defined functions.

I read other related topics like:

Change all columns from factor to numeric in R

Function to change class of columns in R to match the class of an other dataset

Convert type of multiple columns of a dataframe at once

How do I get the classes of all columns in a data frame?

but could not apply the answers to my own problem. Any idea how to make things simple? (if possible!)

zx8754
  • 52,746
  • 12
  • 114
  • 209
Maxence Dum.
  • 121
  • 1
  • 9
  • 1
    How did you get to the "clean data"? We can set the column classes when importing. See `?read.table`, *colClasses* argument. – zx8754 Mar 19 '20 at 08:40
  • Thank you for your comment! Used the readr package and read_cvs2 function. It does the job well (at least as far as I know) since factor were encoded with numbers, no way for the software to find out it actually was a factor! Your suggestion would imply defining what columns are factors from the very start? – Maxence Dum. Mar 19 '20 at 09:46

1 Answers1

2

The following function does what the question asks for.
It matches input data set X column names with the new column types with a sequence of which/match statements, without needing loops. The coercion is performed with lapply loops.
The test data set is the built-in data set mtcars.

coerceCols <- function(X, VarLabels){
  i <- which(VarLabels$TypeVar == "Qual")
  j <- match(VarLabels$VarName[i], names(X))
  X[j] <- lapply(X[j], factor)

  i <- which(VarLabels$TypeVar == "Mix")
  j <- match(VarLabels$VarName[i], names(X))
  tmp <- X[j]
  names(tmp) <- paste(names(tmp), "Num", sep = "_")
  X[j] <- lapply(X[j], factor)

  cbind(X, tmp)
}

Data <- mtcars
VarLabels <- data.frame(VarName = names(mtcars),
                        TypeVar = c("Quant", "Mix", "Quant",
                                    "Quant", "Quant", "Quant",
                                    "Quant", "Qual", "Qual", 
                                    "Mix", "Mix"),
                        stringsAsFactors = FALSE)

coerceCols(Data, VarLabels)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thanks a lot, it works exactly as intended! i tried it with mtcars (your example) and iris to be sure, all is good... but does not work on my own dataframe :( Encounter 'names' attribute [1] must be the same length as the vector [0] error. Both are of length 297 though, I do not get it... – Maxence Dum. Mar 19 '20 at 09:42
  • Managed to overcome the issue, everything operates smoothly, thank you so much for your help! – Maxence Dum. Mar 19 '20 at 19:40