I've been trying to automatize part of my workflow with R. Periodically I have to use transformations in the datasets I am working with.
I have already created a small function that uses optional arguments, so that one can transform all or part of the columns of the passed dataframe.
The function looks like this now:
# Function:
# transformDivideThousand(dataframe, optional = vectorListOfVariables)
#
# Definition: This function applies a transformation, dividing variables by
# 1000. If the vector is passed it applies the transformation to all variables
# in the dataframe.
#
# Example: df <- transformDivideThousand (cases, c("label1","label2"))
#
# Source: http://stackoverflow.com/a/36912017/4417072
transformDivideThousand <- function(data_frame, listofvars){
if (missing(listofvars)) {
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)]/1000
} else {
for (i in names(data_frame)) {
if (i %in% listofvars) {
data_frame[,i] = data_frame[,i]/1000
}
}
}
return(data_frame)
}
Ok, now I face a problem where I have to apply a fairly complex transformation. This time it should:
- reflect the scores stored at the variables (ie, find the largest value and subtract it from all the other values);
- Sum one to the resulting score;
- Square root the resulting score;
- De-reflect the scores (now sum the same value it was subtracted with in the first step)
All this should happen maintaining the ability to run the function in all or in part of the columns of the given dataset.
I found a way of creating a subset of the dataframe with the largest values at SO with a small function:
colMax <- function(data) sapply(data, max, na.rm = TRUE)
But I am running in all sorts of problems while applying it in the transformDivideThousand.
Problem
I am really struggling with the code, so far, trying to model the problem, I reached the following point:
transformPlusOneSqrt <- function(data_frame, listofvars){
if (missing(listofvars)) {
# Find the largest value
data_frame_max <- data_frame
colMax <- function(data) sapply(data, max)
data_frame_max <- colMax(data_frame_max)
# Subtract the previous value
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)] -
data_frame_max[,sapply(data_frame_max, is.numeric)]
# Plus one
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)] + 1
# Sqrt
data_frame[, sapply(data_frame, is.numeric)] =
sqrt(data_frame[, sapply(data_frame, is.numeric)])
# Now, dereflect
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)] +
data_frame_max[,sapply(data_frame_max, is.numeric)]
} else { ### This part is untouched
for (i in names(data_frame)) {
if (i %in% listofvars) {
data_frame[,i] = data_frame[,i]/1000
}
}
}
return(data_frame)
}
But that does not work, as I am getting:
> teste<- transformPlusOneSqrt(semDti)
Show Traceback
Rerun with Debug
Error in Summary.factor(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, :
‘max’ not meaningful for factors
Question
I would appreciate pointers as to how to achieve this rather complex, multitask transformation in one function. I am not looking for code, only pointers and suggestions.
Thanks.