3

I have a tibble. I need to add a new column in which each value is a function of the corresponding values in several other columns. Here is an example:

library(tibble)
tmp <- tribble(
  ~ID,     ~x1,     ~x2,
    1,   "200",     NA,
    2,   "300",   "400")

I want to add a new column, new, that is TRUE if and only if any of the corresponding values in x1 and x2 start with "3". That is, I want

# A tibble: 2 x 4
     ID x1    x2    new  
  <dbl> <chr> <chr> <lgl>
1     1 200   <NA>  NA   
2     2 300   400   TRUE 

In this example, new is a function of only x1 and x2. But there may be many of these "x" columns, and I won't always be able to write out their names. They will always start with "x", though, so this is one solution:

tmp %>%
  mutate(
    new = select(., starts_with("x")) %>%
      apply(., 1, function (x) any(substr(x, 1, 1)=="3"))
  )

But this solution is pretty clunky. Is there a more elegant way?

There are many related questions on Stack Overflow, but they generally speak to cases in which (a) the names of all columns in the original dataset are known and can be written out, or (b) the new variable is a function of all other columns in the data frame. (Here is one example.)

user697473
  • 2,165
  • 1
  • 20
  • 47
  • (1) When you want to work on multiple columns at a time, my first go-to function is `mapply` (or `Map`), and a recent addition to the tidy-landscape is `dplyr::across`, though I'm not proficient enough to make a suggestion there. (2) It really helps us when your verbiage and intended output are clear and consistent; unfortunately, you say *"starts with '3'"* and then mark true when one starts with 3 and one with 4. Please clarify your sample data and intended output. – r2evans Apr 21 '20 at 01:46
  • 1
    Thank you, @r2evans. But where your point (2) is concerned, I don't see the problem. I wrote that the value in the new column should be `TRUE` "if and only if *any* of the corresponding values in `x1` and `x2` start with '3'." With that in mind, the second value in `new` is `TRUE` because the corresponding value of `x1` starts with "3". Am I missing something? – user697473 Apr 21 '20 at 01:56
  • 1
    My bad, I see "any" but previously thought "all". Thanks for straightening me out (I read that several times ... gotta get my eyes checked). – r2evans Apr 21 '20 at 02:49

2 Answers2

1

If you want to stay in tidyverse, we can use pmap for a row-wise operation :

library(dplyr)
library(purrr)

tmp %>% 
   mutate(new = pmap_lgl(select(., starts_with('x')), 
                ~any(startsWith(c(...), '3'), na.rm = TRUE)))

#     ID x1    x2    new  
#  <dbl> <chr> <chr> <lgl>
#1     1 200   NA    FALSE
#2     2 300   400   TRUE 

In base R, we can use row-wise apply

tmp$new <- apply(tmp[grep('x', names(tmp))], 1, function(x) 
                 any(startsWith(x, '3'), na.rm = TRUE))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you - this is helpful. That said, I don't understand why this command works on the "x" columns row-wise instead of column-wise. Can you help me understand? / I think that the following is true: (1) `select(., starts_with('x'))` returns a data frame of length 2 (because 2 columns). (2) `pmap_lgl()` treats this data frame as a 2-element list, with each element being a column. (3) `c(...)` isn't tidyverse; it's just the usual way of capturing arguments to a function in a list. / But if the arguments to the anon. function are the list columns, why is `any()` treating the arguments as rows? – user697473 Apr 21 '20 at 12:22
  • 1
    @user697473 `pmap` as the name suggests is parallel map. They treat arguments in a parallel fashion hence the list returned from `select(., starts_with('x'))` is treated in a parallel way meaning first value in `x1` and first value in `x2` are processed together which makes this row-wise. `any` has no role to play here, it is behaving in the usual way it does. Since there could be multiple arguments passed to amp function `c(...)` is a way in which we capture all the arguments together. – Ronak Shah Apr 21 '20 at 12:59
  • Thank you -- makes perfect sense. – user697473 Apr 21 '20 at 13:03
1

Here is an option with pivot_longer where we reshape into 'long' format with pivot_longer, do a group by 'ID' to check if there are any value that have 3 as the first digit and do a join with the original dataset

library(dplyr)
library(tidyr)
library(stringr)
tmp %>% 
   pivot_longer(cols = -ID, values_drop_na = TRUE) %>% 
   group_by(ID) %>%
   summarise(new = any(str_detect(value, '^3'))) %>% 
   right_join(tmp)
# A tibble: 2 x 4
#     ID new   x1    x2   
#* <dbl> <lgl> <chr> <chr>
#1     1 FALSE 200   <NA> 
#2     2 TRUE  300   400  

Or using base R, we can concatenate by row with paste and use grepl. Should be more efficient

grepl("(^|,)3", do.call(paste, c(tmp[-1], sep=",")))
#[1] FALSE  TRUE
akrun
  • 874,273
  • 37
  • 540
  • 662