0

I am trying to check the values of various binary columns and then mutate a new column populated with a value conditional to those columns.

The new column should indicate which column included 1 when all others were 0 and record the number of that column. If multiple 1s are present, a zero should be recorded. Y1 is included as an indicator that some columns need to be ignored.

For example,
if X1=0, X2=0, X3=1; I'd like Xfoo to be 3,
if X1=1, X2=0, X3=1; I'd like Xfoo to be 0, etc

So far I have been achieving this using dplyr mutate and case_when as follows:

# create toy data:
set.seed(123)
N<-10
mydata <- data.frame(Y1 = sample(0 : 1, N, replace = TRUE),
                     X1 = sample(0 : 1, N, replace = TRUE),
                     X2 = sample(0 : 1, N, replace = TRUE),
                     X3 = sample(0 : 1, N, replace = TRUE))

# perform mutate:
mydata <- mydata %>%
  mutate(
    Xfoo = case_when(
      X1 == 1 & X2 == 0 & X3 == 0 ~ 1,
      X1 == 0 & X2 == 1 & X3 == 0 ~ 2,
      X1 == 0 & X2 == 0 & X3 == 1 ~ 3,
      TRUE ~ 0)
    )

I am happy with this approach given small numbers of variables, however, I realise that this quickly becomes less effective with large numbers of variables.

To combat this, I'd have thought that grouping conditions in some way would keep things concise. Something like:

case_when(
      X1 == 1 & c(X2, X3) ==0 ~ 1
...

I realise that my logic is wrong here, I've been experimenting with all() and any() but can't figure it out and would appreciate some pointers.

harre
  • 7,081
  • 2
  • 16
  • 28
r0bt
  • 383
  • 3
  • 12
  • 2
    You don't provide an explicit description of the pattern you are looking for. But from your example it seems like you want to find rows which have one 1 and the rest zeros, and then grab the column which is 1. If there is no negative numbers, the first step may be solved by `rowSums` over relevant columns (possibly by using `startsWith`). Then use `max.col` to find the non-zero column (see [For each row return the column name of the largest value](https://stackoverflow.com/questions/17735859/for-each-row-return-the-column-name-of-the-largest-value/17735894#17735894)) – Henrik Jul 15 '22 at 14:36
  • How should `Y1` be treated? – harre Jul 15 '22 at 14:37
  • Edited for clarity :) – r0bt Jul 15 '22 at 15:46
  • 1
    If the question is really about just reversing one-hot encoding, then it's probably a duplicate of this: https://stackoverflow.com/questions/40105262/how-to-revert-one-hot-encoded-variable-back-into-single-column – MrFlick Jul 15 '22 at 15:51
  • Apologies if it is. I'm not sure the linked approach could handle multiple 1s in a row could it? – r0bt Jul 15 '22 at 16:03

3 Answers3

1

There's not really a built in function function for multiple comparisons but you could create one

all_same_as <- function(val, ...) {
   Reduce(function(a,b) a & (b==val), list(...), init = TRUE)
}

So this function allows you to pass in any number of columns, and it checks that all the values in those columns are equal to the first value you pass in. So the first parameter should be the number you want the rest to match And then you could use

mydata %>%
  mutate(
    Xfoo = case_when(
      X1 == 1 & all_same_as(0, X2, X3) ~ 1,
      X2 == 1 & all_same_as(0, X1, X3) ~ 2,
      X3 == 1 & all_same_as(0, X1, X2) ~ 3,
      TRUE ~ 0)
  )
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • @MrFlick It seems like I misunderstood OP - I thought the issue was how to avoid hard-coding of the expressions in `case_when` when there is a "_large numbers of variables_". See also my comment on the now deleted answer. Cheers – Henrik Jul 15 '22 at 15:52
  • @Henrik I think you were right. I just chose to focus on alternatives to the `c(X2, X3) ==0` syntax that was attempted but didn't work. Using `all_same_as()` I think would still be easier than adding `==0` for every column you wanted to test. But a recent edit seems there were only ever be one non-zero value so this is just a duplicate of the reverse one-hot encoding questions of which there are plenty on here already. – MrFlick Jul 15 '22 at 15:55
  • As mentioned there are multiple ones present and so I'm not sure that this is reversing one-hot. I accepted this as a solution given large numbers of variables as assumed it would be possible to pass the list of the variables instead of hardcoding them. I misunderstood – r0bt Jul 15 '22 at 16:10
1

You could check what column has the maximum value and extract the column names and the number from that (as indicated by @Henrik in the comments):

library(tidyverse)

mydata |>
  mutate(Xfoo = pmap_dbl(across(starts_with("X")), 
                                ~ if_else(sum(c(...)) == 1,
                                          names(c(...))[which.max(c(...))] |> parse_number(),
                                          0)))

Output:

# A tibble: 10 × 5
      Y1    X1    X2    X3  Xfoo
   <int> <int> <int> <int> <dbl>
 1     0     1     0     0     1
 2     0     1     1     1     0
 3     0     1     0     0     1
 4     1     0     0     1     3
 5     0     1     0     1     0
 6     1     0     0     0     0
 7     1     1     1     0     0
 8     1     0     1     0     2
 9     0     0     0     0     0
10     0     0     1     1     0

See also: For each row return the column name of the largest value

harre
  • 7,081
  • 2
  • 16
  • 28
1

You could try this, but it will be less computationally efficient. I'd use {data.table} if you are working with a large dataset.

mydata %>%
  rowwise() %>%
  mutate(
    Xfoo = case_when(
      all(c(X2,X3) == 0) & X1 == 1 ~ 1,
      all(c(X1,X3) == 0) & X2 == 1 ~ 2,
      all(c(X1,X2) == 0) & X3 == 1 ~ 3,
      TRUE ~ 0)
  ) %>%
  ungroup()
parmsam
  • 19
  • 3
  • This is where my thought process was leading me. Thank you for clarifying and pointing out the lack of efficiency in this approach. – r0bt Jul 15 '22 at 15:47