I have a two-part problem. I've searched all over stack and found answers related to my problems, but no variations I've tried have worked yet. Thanks in advance for any help!
I have a large data frame that contains many variables.
First, I want to (1) standardize a variable by another variable (in my case, speaker), and (2) filter out values after the variable has been standardized (greater than 2 standard deviations away from the mean). (1) and (2) can be taken care of by a function using dplyr.
Second, I have many variables I want to do this for, so I'm trying to find an automated way to do this, such as with a for loop.
Problem 1: Writing a function containing dplyr functions
Here is a sample of what my data frame looks like:
df = data.frame(speaker=c("eng1","eng1","eng1","eng1","eng1","eng1","eng2","eng2","eng2","eng2","eng2"),
ratio_means001=c(0.56,0.202,0.695,0.436,0.342,10.1,0.257,0.123,0.432,0.496,0.832),
ratio_means002=c(0.66,0.203,0.943,0.432,0.345,0.439,0.154,0.234,NA,0.932,0.854))
Output:
speaker ratio_means001 ratio_means002
1 eng1 0.560 0.660
2 eng1 0.202 0.203
3 eng1 0.695 0.943
4 eng1 0.436 0.432
5 eng1 0.342 0.345
6 eng1 10.100 0.439
7 eng2 0.257 0.154
8 eng2 0.123 0.234
9 eng2 0.432 NA
10 eng2 0.496 0.932
11 eng2 0.832 0.854
Below is the basic code I want to turn into a function:
standardized_data = group_by(df, speaker) %>%
mutate(zRatio1 = as.numeric(scale(ratio_means001)))%>%
filter(!abs(zRatio1) > 2)
So that the data frame will now look like this (for example):
speaker ratio_means001 ratio_means002 zRatio1
(fctr) (dbl) (dbl) (dbl)
1 eng1 0.560 0.660 -0.3792191
2 eng1 0.202 0.203 -0.4699781
3 eng1 0.695 0.943 -0.3449943
4 eng1 0.436 0.432 -0.4106552
5 eng1 0.342 0.345 -0.4344858
6 eng2 0.257 0.154 -0.6349445
7 eng2 0.123 0.234 -1.1325034
8 eng2 0.432 NA 0.0148525
9 eng2 0.496 0.932 0.2524926
10 eng2 0.832 0.854 1.5001028
Here is what I have in terms of a function so far. The mutate part works, but I've been struggling with adding the filter part:
library(lazyeval)
standardize_variable = function(col1, new_col_name) {
mutate_call = lazyeval::interp(b = interp(~ scale(a)), a = as.name(col1))
group_by(data,speaker) %>%
mutate_(.dots = setNames(list(mutate_call), new_col_name)) %>%
filter_(interp(~ !abs(b) > 2.5, b = as.name(new_col_name))) # this part does not work
}
I receive the following error when I try to run the function:
data = standardize_variable("ratio_means001","zRatio1")
Error in substitute_(`_obj`[[2]], values) :
argument "_obj" is missing, with no default
Problem 2: Looping over the function
There are many variables that I'd like to apply the above function to, so I would like to find a way to either use a loop or another helpful function to help automate this process. The variable names differ only in a number at the end, so I have come up with something like this:
d <- data.frame()
for(i in 1:2)
{
col1 <- paste("ratio_means00", i, sep = "")
new_col <- paste("zRatio", i, sep = "")
d <- rbind(d, standardize_variable(col1, new_col))
}
However, I get the following error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Thanks again for any help on these issues!