0

Suppose I have a dataframe with many columns which can be matched into pairs.

E.g.

df  = tibble(x = rnorm(1000), y = rnorm(1000))

create_many_columns <- function(df, n) {
  varname1 <- paste("x", n , sep=".")
  varname2 <- paste("y", n , sep=".")
  df %>% 
    mutate(!!varname1 := x * n) %>%
    mutate(!!varname2 := y * n)
}

df

It's clear that we can match columns (x.n and y.n)

# A tibble: 1,000 x 22
        x       y    x.2     y.2   x.3     y.3   x.4    y.4   x.5    y.5   x.6     y.6    x.7     y.7    x.8     y.8
    <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
 1 -1.57   0.597  -3.14   1.19   -4.71  1.79   -6.28  2.39  -7.85  2.99  -9.42   3.58  -11.0    4.18  -12.6    4.78 
 2 -1.20   1.02   -2.40   2.03   -3.60  3.05   -4.80  4.06  -6.00  5.08  -7.20   6.10   -8.40   7.11   -9.60   8.13 
 3  1.16  -0.304   2.32  -0.609   3.47 -0.913   4.63 -1.22   5.79 -1.52   6.95  -1.83    8.10  -2.13    9.26  -2.44 
 4  0.870 -1.73    1.74  -3.45    2.61 -5.18    3.48 -6.90   4.35 -8.63   5.22 -10.4     6.09 -12.1     6.96 -13.8  
 5  0.621  1.89    1.24   3.78    1.86  5.68    2.48  7.57   3.11  9.46   3.73  11.4     4.35  13.2     4.97  15.1  
 6 -0.970  0.347  -1.94   0.694  -2.91  1.04   -3.88  1.39  -4.85  1.74  -5.82   2.08   -6.79   2.43   -7.76   2.78 
 7  0.453  0.0866  0.906  0.173   1.36  0.260   1.81  0.346  2.26  0.433  2.72   0.520   3.17   0.606   3.62   0.693
 8 -0.840 -0.956  -1.68  -1.91   -2.52 -2.87   -3.36 -3.82  -4.20 -4.78  -5.04  -5.73   -5.88  -6.69   -6.72  -7.64 
 9 -0.938 -0.967  -1.88  -1.93   -2.81 -2.90   -3.75 -3.87  -4.69 -4.83  -5.63  -5.80   -6.57  -6.77   -7.51  -7.73 
10 -0.551  0.0267 -1.10   0.0535 -1.65  0.0802 -2.21  0.107 -2.76  0.134 -3.31   0.160  -3.86   0.187  -4.41   0.214
# … with 990 more rows, and 6 more variables: x.9 <dbl>, y.9 <dbl>, x.10 <dbl>, y.10 <dbl>, x.11 <dbl>, y.11 <dbl>

I want to get a sequence of columns which will be a product of the matched columns. E.g.

for(i in 2:11){
  df[[paste0("z.", i)]] = df[[paste0("x.", i)]] * df[[paste0("y.", i)]]
}

df %>% select(contains("z"))
# A tibble: 1,000 x 10
       z.2     z.3    z.4    z.5     z.6     z.7     z.8     z.9    z.10    z.11
     <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 -1.44   -3.25   -5.78  -9.02  -13.0   -17.7   -23.1   -29.2   -36.1   -43.7  
 2  0.865   1.95    3.46   5.41    7.79   10.6    13.8    17.5    21.6    26.2  
 3  0.972   2.19    3.89   6.07    8.75   11.9    15.6    19.7    24.3    29.4  
 4  3.54    7.96   14.2   22.1    31.9    43.4    56.6    71.7    88.5   107.   
 5 -0.298  -0.671  -1.19  -1.86   -2.68   -3.65   -4.77   -6.04   -7.45   -9.02 
 6  4.10    9.22   16.4   25.6    36.9    50.2    65.5    82.9   102.    124.   
 7  3.61    8.12   14.4   22.6    32.5    44.2    57.8    73.1    90.2   109.   
 8 -1.17   -2.64   -4.69  -7.33  -10.5   -14.4   -18.8   -23.7   -29.3   -35.5  
 9  1.52    3.42    6.08   9.50   13.7    18.6    24.3    30.8    38.0    46.0  
10 -0.0328 -0.0738 -0.131 -0.205  -0.295  -0.402  -0.525  -0.665  -0.820  -0.993
# … with 990 more rows

This solution is fine if I don't care about overloading my code with loops. But I do, since I have to apply this type of transformations regularly. Is there any options to write it in a more parsimonious way?

For instance, if I wanted to get an exponent of all elements of "x" columns, I could do

 df %>%
  mutate_at(vars(contains("x")), exp )

rather than write a loop like

for(i in 2:11){
  df[[paste0("x.", i)]] = exp(df[[paste0("x.", i)]] )
}

For the initial example, I would expect, that there is an option to write something like

df %>% mutate(z.n = x.n * y.n, n = 2:11)

Y.Olshanskiy
  • 115
  • 5
  • 2
    The `dplyr` way would be to tidy your data into a long format, so that you're working on groups of rows, not groups of columns. – Gregor Thomas Oct 23 '19 at 01:39
  • Thanks, @Gregor. Would it be an efficient way if I have a big data and there are also many other columns? Let's say in the original example, `T = 10000000; df = tibble(x = rnorm(T), y = rnorm(T), l1 = rnorm(T), l2 = rnorm(T), ..., l20 = rnorm(T))` and I don't want to affect `l1,...,l20`. If it is, could you please give a short code-answer how to do this? – Y.Olshanskiy Oct 23 '19 at 01:58
  • (a) **please** never use `T` as a variable. `T` is `TRUE`. (b) no, it wouldn't be efficient in that case. But it would be the idiomatic way to do it with `dplyr`. See the [Tidy Data vignette](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) for more info. (c) For code examples, see the [FAQ on converting data from wide to long](https://stackoverflow.com/q/2185252/903061) – Gregor Thomas Oct 23 '19 at 02:05
  • And (d), if your data is that big and you care about efficiency, you should be using `data.table` anyway, not `dplyr`, and especially not base. – Gregor Thomas Oct 23 '19 at 02:07
  • @Gregor, thank you for your answer. That's a fair comment about `T`. If you mean using `melt` or `gather` functions, not sure, if it is an efficient way. Probably, I have to use simple loops. But tidy helps in a general case, so thank you for pointing out to them. – Y.Olshanskiy Oct 23 '19 at 02:19
  • And, thank you for mentioning `data.table`, maybe I have to switch. Will read more about its speed comparative to `-plyr` libraries. – Y.Olshanskiy Oct 23 '19 at 02:24
  • `data.table` isn't just fast compared to `dplyr`. It's fast [compared to everything](https://h2oai.github.io/db-benchmark/) – Gregor Thomas Oct 23 '19 at 02:53

0 Answers0