Convert various dummy/logical variables into a single categorical variable/factor from their name in R

Question

My question has strong similarities with this one and this other one, but my dataset is a little bit different and I can't seem to make those solutions work. Please excuse me if I misunderstood something and this question is redundant.

I have a dataset such as this one:

df <- data.frame(
  id = c(1:5),
  conditionA = c(1, NA, NA, NA, 1),
  conditionB = c(NA, 1, NA, NA, NA),
  conditionC = c(NA, NA, 1, NA, NA),
  conditionD = c(NA, NA, NA, 1, NA)
  )
# id conditionA conditionB conditionC conditionD
# 1  1          1         NA         NA         NA
# 2  2         NA          1         NA         NA
# 3  3         NA         NA          1         NA
# 4  4         NA         NA         NA          1
# 5  5          1         NA         NA         NA

(Note that apart from these columns, I have a lot of other columns that shouldn't be affected by the current manipulation.)

So, I observe that conditionA, conditionB, conditionC and conditionD are mutually exclusives and should be better presented as a single categorical variable, i.e. factor, that should look like this :

#   id       type
# 1  1 conditionA
# 2  2 conditionB
# 3  3 conditionC
# 4  4 conditionD
# 5  5 conditionA

I have investigated using gather or unite from tidyr, but it doesn't correspond to this case (with unite, we lose the information from the variable name).

I tried using kimisc::coalescence.na, as suggested in the first referred answer, but 1. I need first to set a factor value based on the name for each column, 2. it doesn't work as expected, only including the first column :

library(kimisc)
# first, factor each condition with a specific label
df$conditionA <- df$conditionA %>%
  factor(levels = 1, labels = "conditionA")
df$conditionB <- df$conditionB %>%
  factor(levels = 1, labels = "conditionB")
df$conditionC <- df$conditionC %>%
  factor(levels = 1, labels = "conditionC")
df$conditionD <- df$conditionD %>%
  factor(levels = 1, labels = "conditionD")

# now coalesce.na to merge into a single variable
df$type <- coalesce.na(df$conditionA, df$conditionB, df$conditionC, df$conditionD)

df
#   id conditionA conditionB conditionC conditionD       type
# 1  1 conditionA       <NA>       <NA>       <NA> conditionA 
# 2  2       <NA> conditionB       <NA>       <NA>       <NA> 
# 3  3       <NA>       <NA> conditionC       <NA>       <NA> 
# 4  4       <NA>       <NA>       <NA> conditionD       <NA> 
# 5  5 conditionA       <NA>       <NA>       <NA> conditionA

I tried the other suggestions from the second question, but haven't found one that would bring me the expected result...

Encoding a dummy as `NA`/`1` instead of `0`/`1` has no upside that I know of. I've been seeing this a lot on SO lately. — Frank, May 19 '15 at 19:34
@Frank: Actually, it was not thought as a dummy variable (e.g. for regression): I manually entered all those data in Excel, putting a `1` each time a condition was satisfied (and didn't bother to fill the rest with `0`). I'm not sure if I should call that a dummy variable (but that's the term I've been encountering)... — iNyar, May 19 '15 at 20:28

Steven Beaupré · Accepted Answer · 2015-05-20T23:51:37.473

7

Try:

library(dplyr)
library(tidyr)

df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)

Which gives:

#  id       type
#1  1 conditionA
#2  2 conditionB
#3  3 conditionC
#4  4 conditionD
#5  5 conditionA

Update

To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join() the other columns:

df %>% 
  select(starts_with("condition"), id) %>% 
  gather(type, value, -id) %>% 
  na.omit() %>% 
  select(-value) %>% 
  left_join(., df %>% select(-starts_with("condition"))) %>%
  arrange(id)

edited May 20 '15 at 23:51

answered May 19 '15 at 19:00

Steven Beaupré

21,343
7
57
77

Great! I hadn't thought about combining `gather` with `na.omit` to obtain this result! Now, I still have a big issue : the actual dataset has a LOT more columns, and a lot of them are partially empty. `na.omit` would then delete these observations, even though they DO have a condition set... (Also, in `gather(type, value, -id)`, if I have to exclude (`-x`) something like 70 columns, is there a quick way to do it?) – iNyar May 19 '15 at 20:24
2

@SergeBibauw Would something like this work for you: `df %>% select(starts_with("condition"), id) %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% left_join(., df %>% select(-starts_with("condition"))) %>% arrange(id)` ? – Steven Beaupré May 19 '15 at 22:27
Thanks! I indeed figured I needed to `select` only wanted columns, apply the `gather` etc. process, and then "reinject" the result into the original data frame. – iNyar May 20 '15 at 17:18

nicola · Answer 2 · 2015-05-20T05:19:34.133

5

You can also try:

colnames(df)[2:5][max.col(!is.na(df[,2:5]))]
#[1] "conditionA" "conditionB" "conditionC" "conditionD" "conditionA"

The above works if one and only one column has a value other than NA for each row. If the values of a row can be all NAs, then you can try:

mat<-!is.na(df[,2:5])
colnames(df)[2:5][max.col(mat)*(NA^!rowSums(mat))]

edited May 20 '15 at 05:19

answered May 19 '15 at 18:56

nicola

24,005
3
35
56

1

Pretty sure this is the most efficient. 4x as fast as Stephen's on `big_df <- rbind(df,df); for (i in 1:18) big_df <- rbind(big_df,big_df)` – Frank May 19 '15 at 20:29
1

It does work quite well, except that, if a row has `` in all four conditions, then this code gives it one of the four conditions (not always the same, I don't even understand well why), when it should rather leave the resulting value to ``. Is there a way to avoid that? – iNyar May 19 '15 at 22:27

Pierre L · Answer 3 · 2015-05-19T19:02:15.713

1

library(tidyr)
library(dplyr)

df <- df %>%
  gather(type, count, -id)
df <- df[complete.cases(df),][,-3]
df[order(df$id),]
   id       type
1   1 conditionA
7   2 conditionB
13  3 conditionC
19  4 conditionD
5   5 conditionA

edited May 19 '15 at 19:02

answered May 19 '15 at 18:51

Pierre L

28,203
6
47
69

1

A more `dplyr`-esque approach would be `df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)` – Steven Beaupré May 19 '15 at 18:57

Convert various dummy/logical variables into a single categorical variable/factor from their name in R

3 Answers3

Linked

Related