Adding standardized variables to a data frame using dplyr and a for loop

Question

I have a two-part problem. I've searched all over stack and found answers related to my problems, but no variations I've tried have worked yet. Thanks in advance for any help!

I have a large data frame that contains many variables.

First, I want to (1) standardize a variable by another variable (in my case, speaker), and (2) filter out values after the variable has been standardized (greater than 2 standard deviations away from the mean). (1) and (2) can be taken care of by a function using dplyr.

Second, I have many variables I want to do this for, so I'm trying to find an automated way to do this, such as with a for loop.

Problem 1: Writing a function containing dplyr functions

Here is a sample of what my data frame looks like:

df = data.frame(speaker=c("eng1","eng1","eng1","eng1","eng1","eng1","eng2","eng2","eng2","eng2","eng2"),
            ratio_means001=c(0.56,0.202,0.695,0.436,0.342,10.1,0.257,0.123,0.432,0.496,0.832),
            ratio_means002=c(0.66,0.203,0.943,0.432,0.345,0.439,0.154,0.234,NA,0.932,0.854))

Output:

     speaker ratio_means001 ratio_means002
1     eng1          0.560          0.660
2     eng1          0.202          0.203
3     eng1          0.695          0.943
4     eng1          0.436          0.432
5     eng1          0.342          0.345
6     eng1         10.100          0.439
7     eng2          0.257          0.154
8     eng2          0.123          0.234
9     eng2          0.432             NA
10    eng2          0.496          0.932
11    eng2          0.832          0.854

Below is the basic code I want to turn into a function:

standardized_data = group_by(df, speaker) %>%
mutate(zRatio1 = as.numeric(scale(ratio_means001)))%>%
filter(!abs(zRatio1) > 2)

So that the data frame will now look like this (for example):

     speaker ratio_means001 ratio_means002   zRatio1
     (fctr)          (dbl)          (dbl)     (dbl)
 1     eng1          0.560          0.660 -0.3792191
 2     eng1          0.202          0.203 -0.4699781
 3     eng1          0.695          0.943 -0.3449943
 4     eng1          0.436          0.432 -0.4106552
 5     eng1          0.342          0.345 -0.4344858
 6     eng2          0.257          0.154 -0.6349445
 7     eng2          0.123          0.234 -1.1325034
 8     eng2          0.432             NA  0.0148525
 9     eng2          0.496          0.932  0.2524926
 10    eng2          0.832          0.854  1.5001028

Here is what I have in terms of a function so far. The mutate part works, but I've been struggling with adding the filter part:

library(lazyeval)
standardize_variable = function(col1, new_col_name) {
     mutate_call = lazyeval::interp(b = interp(~ scale(a)), a = as.name(col1))
     group_by(data,speaker) %>% 
     mutate_(.dots = setNames(list(mutate_call), new_col_name)) %>%
     filter_(interp(~ !abs(b) > 2.5, b = as.name(new_col_name))) # this part does not work
}

I receive the following error when I try to run the function:

data = standardize_variable("ratio_means001","zRatio1")

Error in substitute_(`_obj`[[2]], values) : 
argument "_obj" is missing, with no default

Problem 2: Looping over the function

There are many variables that I'd like to apply the above function to, so I would like to find a way to either use a loop or another helpful function to help automate this process. The variable names differ only in a number at the end, so I have come up with something like this:

d <- data.frame()
for(i in 1:2) 
{ 
 col1 <- paste("ratio_means00", i, sep = "")
 new_col <- paste("zRatio", i, sep = "")
 d <- rbind(d, standardize_variable(col1, new_col))
}

However, I get the following error:

 Error in match.names(clabs, names(xi)) : 
 names do not match previous names

Thanks again for any help on these issues!

One general suggestion is to start with pieces of what you want to do, then build it up. For instance, start with a smaller test data set that you might share here for purposes of reconstructing your problem. Along these lines, if you have two separate problems, then ask the first, get your function working, then move on to the second problem. — lmo, Apr 11 '16 at 22:59
Sample data added. Thanks for your feedback Imo. I debated removing the second part from the problem, but the two pieces are interrelated for my particular problem, so I decided to leave the second part. There may be a simpler way to solve the problem altogether where the two steps could be combined into one (but I haven't figured that out yet). Thanks again for your replies! — Amanda, Apr 12 '16 at 02:06
@Amanda Can you give an example of what the final dataset would look like? The use of `rbind` in your loop makes it look like you are hoping for a long dataset instead of a wide dataset. — aosmith, Apr 12 '16 at 14:32
@aosmith The final dataset looks like your example output, but with 12 variables instead of the 2 example ones — Amanda, Apr 13 '16 at 01:51

aosmith · Accepted Answer · 2016-04-12T15:02:02.300

Alternative 1

I believe the main problem you were having with your function had to do with you calling interp twice. Fixing that led to an additional problem with filter, which I think was due to scale adding attributes (I'm using a development version dplyr, dplyr_0.4.3.9001). Wrapping as.numeric around scale gets rid of that.

So with the fixes your function looks like:

standardize_variable = function(col1, new_col_name) {
    mutate_call = lazyeval::interp(~as.numeric(scale(a)), a = as.name(col1))
    group_by(df, speaker) %>% 
        mutate_(.dots = setNames(list(mutate_call), new_col_name)) %>%
        filter_(interp(~ !abs(b) > 2, b = as.name(new_col_name)))
}

I found the loop through the variables to be a bit more complicated than what you had, as I believe you want to merge your datasets back together once you make one for each variable. One option is to save them to a list and then use do.call with merge to get the final dataset.

d = list()
for(i in 1:2) { 
    col1 <- paste("ratio_means00", i, sep = "")
    new_col <- paste("zRatio", i, sep = "")
    d[[i]] = standardize_variable(col1, new_col)
}

do.call(merge, d)

  speaker ratio_means001 ratio_means002    zRatio1    zRatio2
1    eng1          0.202          0.203 -0.4699781 -1.1490444
2    eng1          0.342          0.345 -0.4344858 -0.6063693
3    eng1          0.436          0.432 -0.4106552 -0.2738853
4    eng1          0.560          0.660 -0.3792191  0.5974521
5    eng1          0.695          0.943 -0.3449943  1.6789806
6    eng2          0.123          0.234 -1.1325034 -0.7620572
7    eng2          0.257          0.154 -0.6349445 -0.9590348
8    eng2          0.496          0.932  0.2524926  0.9565726
9    eng2          0.832          0.854  1.5001028  0.7645194

Alternative 2

An alternative to all of this would be to use mutate_each and rename_ for the first part of the problem and then use an interp with a lapply loop for the final filtering of all of the scaled variables simultaneously.

In the code below I take advantage of the fact that mutate_each allows naming for single functions starting in dplyr_0.4.3.9001. Things look a bit complicated in rename_ because I was making the names you wanted for the new columns. To simplify things you could leave them ending in _z from mutate_each and save yourself the complicated step of rename_ with gsub and grepl.

df2 = df %>%
    group_by(speaker) %>%
    mutate_each(funs(z = as.numeric(scale(.))), starts_with("ratio_means00")) %>%
    rename_(.dots = setNames(names(.)[grepl("z", names(.))], 
                        paste0("zR", gsub("r|_z|_means00", "", names(.)[grepl("z", names(.))]))))

Once that's done, you just need to filter by multiple columns. I think it's easiest to make a list of conditions you want to filter with use interp and lapply, and then give that to the .dots argument of filter_.

dots = lapply(names(df2)[starts_with("z", vars = names(df2))],
                         function(y) interp(~abs(x) < 2, x = as.name(y)))

filter_(df2, .dots = dots)

Source: local data frame [9 x 5]
Groups: speaker [2]

  speaker ratio_means001 ratio_means002    zRatio1    zRatio2
   (fctr)          (dbl)          (dbl)      (dbl)      (dbl)
1    eng1          0.560          0.660 -0.3792191  0.5974521
2    eng1          0.202          0.203 -0.4699781 -1.1490444
3    eng1          0.695          0.943 -0.3449943  1.6789806
4    eng1          0.436          0.432 -0.4106552 -0.2738853
5    eng1          0.342          0.345 -0.4344858 -0.6063693
6    eng2          0.257          0.154 -0.6349445 -0.9590348
7    eng2          0.123          0.234 -1.1325034 -0.7620572
8    eng2          0.496          0.932  0.2524926  0.9565726
9    eng2          0.832          0.854  1.5001028  0.7645194

Alternative 3

I often find these problems most straightforward if I reshape the dataset instead of working across columns. For example, still using the newest version of mutate_each but skipping the renaming step for simplicity you could gather all the standardized columns together using the gather function from tidyr and then filter the new column.

library(tidyr)

df %>%
    group_by(speaker) %>%
    mutate_each(funs(z = as.numeric(scale(.))), starts_with("ratio_means00")) %>%
    gather(group, zval, ends_with("_z")) %>%
    filter(abs(zval) <2 )

# First 12 lines of output

Source: local data frame [20 x 5]
Groups: speaker [2]

   speaker ratio_means001 ratio_means002            group       zval
    <fctr>          <dbl>          <dbl>            <chr>      <dbl>
1     eng1          0.560          0.660 ratio_means001_z -0.3792191
2     eng1          0.202          0.203 ratio_means001_z -0.4699781
3     eng1          0.695          0.943 ratio_means001_z -0.3449943
4     eng1          0.436          0.432 ratio_means001_z -0.4106552
5     eng1          0.342          0.345 ratio_means001_z -0.4344858
6     eng2          0.257          0.154 ratio_means001_z -0.6349445
7     eng2          0.123          0.234 ratio_means001_z -1.1325034
8     eng2          0.432             NA ratio_means001_z  0.0148525
9     eng2          0.496          0.932 ratio_means001_z  0.2524926
10    eng2          0.832          0.854 ratio_means001_z  1.5001028
11    eng1          0.560          0.660 ratio_means002_z  0.5974521
12    eng1          0.202          0.203 ratio_means002_z -1.1490444
...

If the desired final form is the wide format, you can use spread (also from tidyr for that. One advantage (to me) is that you can keep all values of one variable even when another variable failed the filtering step.

df %>%
    group_by(speaker) %>%
    mutate_each(funs(z = as.numeric(scale(.))), starts_with("ratio_means00")) %>%
    gather(group, zval, ends_with("_z")) %>%
    filter(abs(zval) <2 ) %>%
    spread(group, zval)

Source: local data frame [11 x 5]
Groups: speaker [2]

   speaker ratio_means001 ratio_means002 ratio_means001_z ratio_means002_z
    <fctr>          <dbl>          <dbl>            <dbl>            <dbl>
1     eng1          0.202          0.203       -0.4699781       -1.1490444
2     eng1          0.342          0.345       -0.4344858       -0.6063693
3     eng1          0.436          0.432       -0.4106552       -0.2738853
4     eng1          0.560          0.660       -0.3792191        0.5974521
5     eng1          0.695          0.943       -0.3449943        1.6789806
6     eng1         10.100          0.439               NA       -0.2471337
7     eng2          0.123          0.234       -1.1325034       -0.7620572
8     eng2          0.257          0.154       -0.6349445       -0.9590348
9     eng2          0.432             NA        0.0148525               NA
10    eng2          0.496          0.932        0.2524926        0.9565726
11    eng2          0.832          0.854        1.5001028        0.7645194

If you don't want to keep the NA, you can always na.omit them at a later time.

Thanks so much for your helpful answer, aosmith! For Alternative 1, everything worked fine with the sample data, but the merge function didn't work with my real data when I had more than 2 variables to run through, but I found a way to get around this [link](http://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list). For Alternative two, I couldn't reproduce your results. I got stuck on df2, with this error: Error in gsub("r|_z|_means00", "", names(.)[ends_with("z", vars = names(.))]) : could not find function "ends_with" — Amanda, Apr 12 '16 at 04:58
@Amanda I'm using the development version of *dplyr*, see [here](https://github.com/hadley/dplyr). This allows for naming in `mutate_each` and the use of the `select` helper functions. — aosmith, Apr 12 '16 at 14:29
Ah, I understand now. Thanks for both of your elegant solutions! — Amanda, Apr 12 '16 at 20:19

Adding standardized variables to a data frame using dplyr and a for loop

1 Answers1