How to separate each column name of a matrix by the +

Question

I have built a matrix whose names are those of a regressor subset that i want to insert in a regression model formula in R. For example:

data$age is the response variable

X is the design matrix whose column names are, for example, data$education and data$wage.

The problem is that the column names of X are not fixed (i.e. i don't know which are them in advance), so i tried to code this:

best_model <- lm(data$age ~ paste(colnames(x[, GA@solution == 1]), sep = "+"))

But it doesn't work.

Here is a similar question: https://stackoverflow.com/questions/9238038/passing-a-vector-of-variables-into-lm-formula. The question and the solution there seem to answer your question. — mt1022, Jan 16 '19 at 12:30

younggeun · Accepted Answer · 2019-01-16T14:15:09.317

Rather than writing formula by yourself, using pipe(%>%) and dplyr::select() appropriately might be helpful. (Here, change your matrix to data frame.)

library(tidyverse)
mpg
#> # A tibble: 234 x 11
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
#>  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
#>  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
#>  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
#>  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
#>  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
#>  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
#>  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
#>  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
#> 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
#> # ... with 224 more rows

Select

dplyr::select() subsets column.

mpg %>% 
  select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
  lm(hwy ~ ., data = .)
#> 
#> Call:
#> lm(formula = hwy ~ ., data = .)
#> 
#> Coefficients:
#>            (Intercept)   manufacturerchevrolet       manufacturerdodge  
#>                2.65526                -1.08632                -2.55442  
#>       manufacturerford       manufacturerhonda     manufacturerhyundai  
#>               -2.29897                -2.98863                -0.94980  
#>       manufacturerjeep  manufacturerland rover     manufacturerlincoln  
#>               -3.36654                -1.87179                -1.10739  
#>    manufacturermercury      manufacturernissan     manufacturerpontiac  
#>               -2.64828                -2.44447                 0.75427  
#>     manufacturersubaru      manufacturertoyota  manufacturervolkswagen  
#>               -3.04204                -2.73963                -1.62987  
#>                  displ                     cyl                     cty  
#>               -0.03763                 0.06134                 1.33805

Denote that -col.name exclude that column. %>% enables formula to use . notation.

Tidyselect

Lots of data sets group their columns using underscore.

nycflights13::flights
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1      517            515         2      830
#>  2  2013     1     1      533            529         4      850
#>  3  2013     1     1      542            540         2      923
#>  4  2013     1     1      544            545        -1     1004
#>  5  2013     1     1      554            600        -6      812
#>  6  2013     1     1      554            558        -4      740
#>  7  2013     1     1      555            600        -5      913
#>  8  2013     1     1      557            600        -3      709
#>  9  2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

For instance, both dep_delay and arr_delay are about delay time. Select helpers such as starts_with(), ends_with(), and contains() can handle this kind of columns.

nycflights13::flights %>% 
  select(starts_with("sched"),
         ends_with("delay"),
         distance)
#> # A tibble: 336,776 x 5
#>    sched_dep_time sched_arr_time dep_delay arr_delay distance
#>             <int>          <int>     <dbl>     <dbl>    <dbl>
#>  1            515            819         2        11     1400
#>  2            529            830         4        20     1416
#>  3            540            850         2        33     1089
#>  4            545           1022        -1       -18     1576
#>  5            600            837        -6       -25      762
#>  6            558            728        -4        12      719
#>  7            600            854        -5        19     1065
#>  8            600            723        -3       -14      229
#>  9            600            846        -3        -8      944
#> 10            600            745        -2         8      733
#> # ... with 336,766 more rows

After that, just %>% lm().

nycflights13::flights %>% 
  select(starts_with("sched"),
         ends_with("delay"),
         distance) %>% 
  lm(dep_delay ~ ., data = .)
#> 
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#> 
#> Coefficients:
#>    (Intercept)  sched_dep_time  sched_arr_time       arr_delay  
#>      -0.151408        0.002737        0.000951        0.816684  
#>       distance  
#>       0.001859

How to separate each column name of a matrix by the +

1 Answers1

Select

Tidyselect