2

First of all I am aware of the related questions / answers located on the following pages.

Convert multiple binary columns to single categorical column

For each row return the column name of the largest value

However my question is slightly different and these solutions above will not work for me.

Given a dataset with binary variables which may overlap, what is the most efficient way to combine them into a single categorical column?

As a simple example consider the following dataset

set.seed(12345)
d1<-data.frame(score=rnorm(10),
               Male=sample(c(rep(1,5), rep(0,5))), 
               White=sample(c(rep(1,5),rep(0,5))), 
               college_ed = rep(c(1,0),5))

head(d1,5)

      score   Male   White college_ed
1  0.5855288    1     0          1
2  0.7094660    1     1          0
3 -0.1093033    0     1          1
4 -0.4534972    0     1          0
5  0.6058875    1     1          1

The objective here is to create a new colum that will list the names of all columns equal to one.

So far this is the best solution I have come up with but I worry it is a little crude and may not be efficient if applied to a much larger data set.

 grp_name<-function(x){
   if(sum(x)==0){
   z<- "None"
   }else{
   z<-paste(names(x[x==1]),collapse= "-")
   }
   return(z)
   }


d1$demo<-apply(d1,1,grp_name)

     score    Male   White    college_ed        demo
1  0.5855288    1     0          1       Male-college_ed
2  0.7094660    1     1          0            Male-White
3 -0.1093033    0     1          1      White-college_ed
4 -0.4534972    0     1          0                 White
5  0.6058875    1     1          1 Male-White-college_ed

Anyone know of some packages for this problem or have any suggestions for speeding up the code?

M.Bergen
  • 174
  • 10

1 Answers1

1

Not a perfect solution but should get you on your way to something faster. The if statement doesn't vectorize but ifelse() does: see below.... no need to use the apply function.

set.seed(12345)
d1<-data.frame(score=rnorm(10),
               Male=sample(c(rep(1,5), rep(0,5))), 
               White=sample(c(rep(1,5),rep(0,5))), 
               college_ed = rep(c(1,0),5))

head(d1,5)

makeKey <- function(x,y,z){
  s1 <- ifelse(x == 1,"Male", "")
  s2 <- ifelse(y == 1, "White", "")
  s3 <- ifelse(z == 1, "college_ed", "")
  s4 <- paste(s1,s2,s3, sep = "-" )
  return(s4)
}

d1$key <- makeKey(x=d1$Male, y=d1$White, z=d1$college_ed)
Kgrey
  • 211
  • 1
  • 3
  • Great suggestion! I realize it might have better for comparisons of processing time if I had made the example data set bigger. When I increased the size to 1 million rows your solution was around 6x faster. – M.Bergen Dec 02 '18 at 00:09