3

A selected answer to a question here:

creating a factor variable with dplyr?

Did not impress Hadley and the follow-up answer does not generalise well for some of the problems I've come across. I'm wondering if the community can do something better with a simpler example:

  ### DATA ###
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])

dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")

names(dummies) <- header

### ONE WAY ###
dummies$Religion <- factor(ifelse(dummies$Christian==1, "Christian",
                            ifelse(dummies$Muslim==1, "Muslim",
                                   ifelse(dummies$Athiest==1, "Athiest", NA))))

Solution mimics the result provided to the OP in the link above. Is there a simpler function to collapse the dummy variables to one factor variable, like say the egen group function in STATA?? Simple one liner would be great.

Using Akrun's solution and system time (thank you):

set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])

dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")

names(dummies) <- header
attach(dummies)



#Alistaire
system.time({
  dummies %>% rowwise() %>% 
    transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user  system elapsed 
# 56.08    0.00   56.08 

system.time({
  dummies %>% transmute(religion = case_when(
    as.logical(Christian) ~ 'Christian', 
    as.logical(Muslim) ~ 'Muslim', 
    as.logical(Athiest) ~ 'Atheist'))
})
# user  system elapsed 
# 0.22    0.04    0.27 



#Curt F.
system.time({
  dummies %>% 
    gather(religion, is_valid) %>% 
    filter(is_valid == T) %>%
    select(-is_valid)
})
# user  system elapsed 
# 0.33    0.03    0.36 




#Akrun
system.time({
  names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user  system elapsed 
# 0.13    0.06    0.21 

system.time({
  names(dummies)[max.col(dummies, "first")]
})
# user  system elapsed 
# 0.04    0.07    0.11 

I find that Akrun's solution works out to be the fastest method and provided 2 one-liners. However, many thanks to the others for their unique approaches to the problem and generous supply of coding methods that I would like to learn more about, especially the use of %%, names(.), is_valid and the qdapTools package.

Community
  • 1
  • 1
mkrasmus
  • 133
  • 11

4 Answers4

5

A quick way with dplyr would be

dummies %>% rowwise() %>% 
    transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])

What Hadley's really complaining about in that answer is nested ifelse structure, though. He's built case_when to replace it:

dummies %>% transmute(religion = case_when(
    as.logical(Christian) ~ 'Christian', 
    as.logical(Muslim) ~ 'Muslim', 
    as.logical(Athiest) ~ 'Atheist'))
alistaire
  • 42,459
  • 4
  • 77
  • 117
2

We can use

dummies$Religion <- names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]

Or with max.col

dummies$Religion <- names(dummies)[max.col(dummies, "first")]

If there are rows that have only 0 elements, then

dummies$Religion <- names(dummies)[max.col(dummies, "first")*NA^(!rowSums(dummies))]

NOTE: In all of the above solution, it can be wrapped with factor. But, it is better to keep it as character

NOTE2: Both the solutions are base R only one-line solutions and are very fast compared to any package solution (proof is showed in the benchmarks below)

Benchmarks

set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
 

C = c(rep(0,1e6), 1 - A[1000001:2000000])

dummies <- data.frame(A,B,C)
colnames(dummies) <- c("Christian", "Muslim", "Athiest")

system.time({
dummies %>% rowwise() %>% 
    transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
#  user  system elapsed 
#  49.13    0.06   49.55 

system.time({
dummies %>% transmute(religion = case_when(
    as.logical(Christian) ~ 'Christian', 
    as.logical(Muslim) ~ 'Muslim', 
    as.logical(Athiest) ~ 'Atheist'))
  })
#Error in mutate_impl(.data, dots) : object 'Christian' not found
#Timing stopped at: 0 0 0 

system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
#  user  system elapsed 
#   0.11    0.01    0.13 

system.time({
names(dummies)[max.col(dummies, "first")]
})
# user  system elapsed 
#   0.07    0.02    0.08 
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
2

One way to do this is to combine tidyr and dplyr. This may not give the fastest performance (I haven't checked), but to me at least it gives the easiest-to-understand code.

Start with the dummies data frame from the OP:

A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])

dummies <- as.data.frame(cbind(A, B, C))
header <- c("Christian", "Muslim", "Atheist")
names(dummies) <- header

Then the gather() function from tidyr does the heavy lifting, and filter() and select() from dplyr do the cleanup.

require(tidyr)
require(dplyr)
dummies %>% 
    gather(religion, is_valid) %>% 
    filter(is_valid == T) %>%
    select(-is_valid)

The nice thing about this version is that it doesn't make any assumptions about the one-hotness of the initial dataframe. If some row in the initial frame is both an atheist and a Christian, your output will have two rows.

Curt F.
  • 4,690
  • 2
  • 22
  • 39
1

If the main intent of the OP is to create the Religion column this can be done directly in one call:

Religion <- sample(c("Christian", "Muslim", "Atheist"), 200, replace = TRUE, 
                   prob = c(60, 20, 20))

The parameter prob can be used to specify the probability weights. Just to check:

table(Religion)
#Religion
#  Atheist Christian    Muslim 
#       37       115        48 

However, if the dummies data.frame would be required for some reason, it could be created from the Religion vector with the following code:

mat <- sapply(unique(Religion), function(x) as.integer(Religion == x))
dummies <- cbind(as.data.frame(mat), Religion)

This will result in:

head(dummies)
#  Muslim Christian Atheist  Religion
#1      1         0       0    Muslim
#2      1         0       0    Muslim
#3      0         1       0 Christian
#4      1         0       0    Muslim
#5      0         1       0 Christian
#6      0         0       1   Atheist

Note that the result may look different for different runs of sample() as we haven't used set.seed() before calling sample().


From this answer I learned about the mtabulate() function from package qdapTools which can replace the sapply() construct by a one-liner:

dummies <- cbind(qdapTools::mtabulate(Religion), Religion)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134