First of all I am aware of the related questions / answers located on the following pages.
Convert multiple binary columns to single categorical column
For each row return the column name of the largest value
However my question is slightly different and these solutions above will not work for me.
Given a dataset with binary variables which may overlap, what is the most efficient way to combine them into a single categorical column?
As a simple example consider the following dataset
set.seed(12345)
d1<-data.frame(score=rnorm(10),
Male=sample(c(rep(1,5), rep(0,5))),
White=sample(c(rep(1,5),rep(0,5))),
college_ed = rep(c(1,0),5))
head(d1,5)
score Male White college_ed
1 0.5855288 1 0 1
2 0.7094660 1 1 0
3 -0.1093033 0 1 1
4 -0.4534972 0 1 0
5 0.6058875 1 1 1
The objective here is to create a new colum that will list the names of all columns equal to one.
So far this is the best solution I have come up with but I worry it is a little crude and may not be efficient if applied to a much larger data set.
grp_name<-function(x){
if(sum(x)==0){
z<- "None"
}else{
z<-paste(names(x[x==1]),collapse= "-")
}
return(z)
}
d1$demo<-apply(d1,1,grp_name)
score Male White college_ed demo
1 0.5855288 1 0 1 Male-college_ed
2 0.7094660 1 1 0 Male-White
3 -0.1093033 0 1 1 White-college_ed
4 -0.4534972 0 1 0 White
5 0.6058875 1 1 1 Male-White-college_ed
Anyone know of some packages for this problem or have any suggestions for speeding up the code?