0

I've been trying to automatize part of my workflow with R. Periodically I have to use transformations in the datasets I am working with.

I have already created a small function that uses optional arguments, so that one can transform all or part of the columns of the passed dataframe.

The function looks like this now:

# Function:
#   transformDivideThousand(dataframe, optional = vectorListOfVariables)
#
# Definition: This function applies a transformation, dividing variables by
# 1000. If the vector is passed it applies the transformation to all variables
# in the dataframe.
#
# Example: df <- transformDivideThousand (cases, c("label1","label2"))
#
# Source: http://stackoverflow.com/a/36912017/4417072

transformDivideThousand <- function(data_frame, listofvars){
    if (missing(listofvars)) {
        data_frame[, sapply(data_frame, is.numeric)] =
            data_frame[, sapply(data_frame, is.numeric)]/1000
    } else {
        for (i in names(data_frame)) {
            if (i %in% listofvars) {
                data_frame[,i] = data_frame[,i]/1000
            }
        }
    }
    return(data_frame)
}

Ok, now I face a problem where I have to apply a fairly complex transformation. This time it should:

  1. reflect the scores stored at the variables (ie, find the largest value and subtract it from all the other values);
  2. Sum one to the resulting score;
  3. Square root the resulting score;
  4. De-reflect the scores (now sum the same value it was subtracted with in the first step)

All this should happen maintaining the ability to run the function in all or in part of the columns of the given dataset.

I found a way of creating a subset of the dataframe with the largest values at SO with a small function:

colMax <- function(data) sapply(data, max, na.rm = TRUE)

But I am running in all sorts of problems while applying it in the transformDivideThousand.

Problem

I am really struggling with the code, so far, trying to model the problem, I reached the following point:

transformPlusOneSqrt <- function(data_frame, listofvars){
    if (missing(listofvars)) {

        # Find the largest value
        data_frame_max <- data_frame
        colMax <- function(data) sapply(data, max)
        data_frame_max <- colMax(data_frame_max)

        # Subtract the previous value
        data_frame[, sapply(data_frame, is.numeric)] =
            data_frame[, sapply(data_frame, is.numeric)] -
            data_frame_max[,sapply(data_frame_max, is.numeric)]

        # Plus one
        data_frame[, sapply(data_frame, is.numeric)] =
            data_frame[, sapply(data_frame, is.numeric)] + 1

        # Sqrt
        data_frame[, sapply(data_frame, is.numeric)] =
            sqrt(data_frame[, sapply(data_frame, is.numeric)])

        # Now, dereflect
        data_frame[, sapply(data_frame, is.numeric)] =
            data_frame[, sapply(data_frame, is.numeric)] +
            data_frame_max[,sapply(data_frame_max, is.numeric)]

    } else {  ### This part is untouched
        for (i in names(data_frame)) {
            if (i %in% listofvars) {
                data_frame[,i] = data_frame[,i]/1000
            }
        }
    }
    return(data_frame)
}

But that does not work, as I am getting:

    > teste<- transformPlusOneSqrt(semDti)
 Show Traceback

 Rerun with Debug
 Error in Summary.factor(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,  : 
  ‘max’ not meaningful for factors

Question

I would appreciate pointers as to how to achieve this rather complex, multitask transformation in one function. I am not looking for code, only pointers and suggestions.

Thanks.

lf_araujo
  • 1,991
  • 2
  • 16
  • 39

1 Answers1

1

The problem is that max() and therefore colMax don't work on data of class factor.

You have 2 choices:

  1. Test for factor class data (if(class(data_frame[,i]) == "factor")) and convert to numeric where appropriate

  2. Use this function that takes the max of a factor variable based on frequency:

    MaxTable <- function(InVec, mult = FALSE) {
     if (!is.factor(InVec)) InVec <- factor(InVec)
     A <- tabulate(InVec)
     if (isTRUE(mult)) {
      levels(InVec)[A == max(A)]
      }
     else levels(InVec)[which.max(A)]
    }
    
Hack-R
  • 22,422
  • 14
  • 75
  • 131