0

When I create transformed variables in a dataframe (e.g., standardized versions of existing variables), it would often be convenient if they were placed alongside their "parent" variables in this manner (e.g., varname1, varname1_z, varname2, varname2_z, etc.). But the default placement of new variables is at the end of the dataframe.

Is there a way to efficiently place newly created variables alongside their "parent" variable, so as to keep the data better organized?

I can currently change variable positions in the dataframe using select() and a custom function called moveme(), but I am trying to make this variable placement process more automated, so that variables are placed as they are created.

In the code example below you'll see all of my newly created "_z" variables at the end of the dataframe. Is there a way I could place them alongside their unstandardized parent in an automated manner?

Thank you for any support.

library(tidyverse)

mpg %>% 
  mutate_if(is.numeric, funs(z = scale(.) %>% as.double()))
Joe
  • 3,217
  • 3
  • 21
  • 37
  • This doesn't really answer your question, but you should think about whether you really want to use wide-formatted data in the tidyverse. One way to get z-scores for everything right next to the source is to have only two columns: one for the raw variables and one for the z-scores. Like this: `numeric_cols <- mpg %>% select_if(is.numeric) %>% names; mpg %>% group_by_if(not_numeric) %>% gather(quantity, value, !! numeric_cols) %>% group_by(quantity) %>% mutate(z_score = scale(value))` – Curt F. Nov 06 '17 at 20:57
  • Thank you for the thoughtful suggestion @Curt F. Maybe I'm hung up, but I think it is often useful (even in the tidyverse) to have variable order reflect the data collection order. Z-scored variables were really just an example. In cleaning a survey dataset I often need to reverse-score variables in order to get to a dataset that is both tidy and ready for analysis. – Joe Nov 06 '17 at 21:26
  • I'll post an example of how you can get the arrangement you want by going through multiple rounds of spread and gather. – Curt F. Nov 06 '17 at 21:35

1 Answers1

1

One approach is to:

  1. First gather() your data into a "long" format, grouping as required
  2. Do data transformations (e.g. create z-scores) to add a new column
  3. gather() again to put new and old data columns in a single column
  4. spread() the data into a format that you like.

Here's an example:

numeric_cols <- mpg %>% select_if(is.numeric) %>% names

mpg %>%
    rownames_to_column %>%
    mutate(rowname = rowname %>% as.numeric) %>%
    gather(param, quantity, !! numeric_cols) %>%
    group_by(param) %>%
    mutate(z_score = scale(quantity)) %>%
    gather(number_type, value, quantity:z_score)  %>%
    unite(new_col_name, param, number_type) %>%
    spread(new_col_name, value)

This results in a data frame that has a head() of

  rowname manufacturer model trans      drv fl class   cty_quantity cty_z_score
1 1       audi         a4    auto(l5)   f   p  compact 18            0.2681016 
2 2       audi         a4    manual(m5) f   p  compact 21            0.9729978 
3 3       audi         a4    manual(m6) f   p  compact 20            0.7380324 
4 4       audi         a4    auto(av)   f   p  compact 21            0.9729978 
5 5       audi         a4    auto(l5)   f   p  compact 16           -0.2018293 
6 6       audi         a4    manual(m5) f   p  compact 18            0.2681016 
  cyl_quantity cyl_z_score displ_quantity displ_z_score hwy_quantity
1 4            -1.1721058  1.8            -1.2939999    29          
2 4            -1.1721058  1.8            -1.2939999    29          
3 4            -1.1721058  2.0            -1.1391962    31          
4 4            -1.1721058  2.0            -1.1391962    30          
5 6             0.0689474  2.8            -0.5199816    26          
6 6             0.0689474  2.8            -0.5199816    26          
  hwy_z_score year_quantity year_z_score
1 0.9336964   1999          -0.997861   
2 0.9336964   1999          -0.997861   
3 1.2695687   2008           0.997861   
4 1.1016326   2008           0.997861   
5 0.4298879   1999          -0.997861   
6 0.4298879   1999          -0.997861   

I think that has the form you want. A couple of notes:

  • The group_by is necessary so that z-scores are created for each parameter, instead of calculating a nonsense z-score over all numbers in the dataset
  • The columns are arranged this way because we did unite(..., param, number_type) instead of unite(..., number_type, param). The latter would have grouped all untransformed columns together, followed by all z-score columns.
  • The rownames_to_column line is required because otherwise the rows of the starting dataframe are not unique, which gives spread() problems when trying to figure out how to re-form the wide data.
Curt F.
  • 4,690
  • 2
  • 22
  • 39
  • That is some impressive data wrangling @ Curt F. Do you have any suggestions about performing these tactics when you only to transform some of the variables? I often need to reverse code about 1/3 of my variables because of their valence. – Joe Nov 06 '17 at 23:24