5

So, I've been using R on and off for two years now and been trying to get this whole idea of vectorization. Since I deal a lot with dummy variables from multiple response sets from surveys I thought it would be interesting to learn with this case.

The idea is to go from multiple responses to dummy variables (and back), for example: "Of these 8 different chocolates, which are your favorite ones (choose up to 3) ?"

Sometimes we code this as dummy variables (1 for person likes "Cote d'Or", 0 for person doesn't like it), with 1 variable per option, and some times as categorical (1 for person likes "Cote d'Or", 2 for person likes "Lindt", and so on), with 3 variables for the 3 choices.

So, basically I can end up with one a matrix which lines are like

1,0,0,1,0,0,1,0

Or a matrix with lines like

1,4,7

And the idea, as mentioned, is to go from one to the other. So far I got a loop solution for each case and a vectorized solution for going from dummy to categorical. I would appreciate any further insigh into this matter and a vectorized solution for the categorical to dummy step.

DUMMY TO NOT DUMMY

vecOrig<-matrix(0,nrow=18,ncol=8)  # From this one
vecDest<-matrix(0,nrow=18,ncol=3)  # To this one

# Populating the original matrix.
# I'm pretty sure this could have been added to the definition of the matrix, 
# but I kept getting repeated numbers.
# How would you vectorize this?
for (i in 1:length(vecOrig[,1])){               
vecOrig[i,]<-sample(vec)
}

# Now, how would you vectorize this following step... 
for(i in 1:length(vecOrig[,1])){            
  vecDest[i,]<-grep(1,vecOrig[i,])
}

# Vectorized solution, I had to transpose it for some reason.
vecDest2<-t(apply(vecOrig,1,function(x) grep(1,x)))   

NOT DUMMY TO DUMMY

matOrig<-matrix(0,nrow=18,ncol=3)  # From this one
matDest<-matrix(0,nrow=18,ncol=8)  # To this one.

# We populate the origin matrix. Same thing as the other case. 
for (i in 1:length(matOrig[,1])){         
  matOrig[i,]<-sample(1:8,3,FALSE)
}

# this works, but how to make it vectorized?
for(i in 1:length(matOrig[,1])){          
  for(j in matOrig[i,]){
    matDest[i,j]<-1
  }
}

# Not a clue of how to vectorize this one. 
# The 'model.matrix' solution doesn't look neat.
Jose Luis
  • 3,307
  • 3
  • 36
  • 53
fioghual
  • 509
  • 3
  • 11
  • 1
    Question: Why do this at all? What is the end goal? – Brandon Bertelsen Dec 18 '12 at 15:34
  • 1
    Haha, first answer: to learn. Next: munge data to needs. Also: develop R competencies! – fioghual Dec 18 '12 at 15:42
  • In this particular case, I have a db with 239 variables and 2000+ cases. Some variables are coded as dummy some others as categorical. I use R but as a team we work with SPSS. Many times we need to get the "other" version for some calculations in SPSS (cluster analysis, MCA, etc...). – fioghual Dec 18 '12 at 15:54

2 Answers2

4

Vectorized solutions:

Dummy to not dummy

vecDest <- t(apply(vecOrig == 1, 1, which))

Not dummy to dummy (back to the original)

nCol <- 8

vecOrig <- t(apply(vecDest, 1, replace, x = rep(0, nCol), values = 1))
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • Thanks, this looks like what I wanted... The first one looks similar to the one I had, but with a more complex syntax. The second one is the one I was looking for! Cheers. – fioghual Dec 18 '12 at 15:49
  • Anyone can provide insight as to why does it have to be transposed? – fioghual Dec 18 '12 at 18:46
  • 1
    It has to be transposed since `apply` automatically uses the returned vectors as columns of the new matrix. – Sven Hohenstein Dec 18 '12 at 19:02
0

This might provide some inside for the first part:

#Create example data
set.seed(42)
vecOrig<-matrix(rbinom(20,1,0.2),nrow=5,ncol=4)

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    1
[2,]    1    0    0    1
[3,]    0    0    1    0
[4,]    1    0    0    0
[5,]    0    0    0    0

Note that this does not assume, that the number of ones is equal in each line (e.g., you wrote "choose up to 3").

#use algebra to create position numbers
vecDest <- t(t(vecOrig)*1:ncol(vecOrig))

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    4
[2,]    1    0    0    4
[3,]    0    0    3    0
[4,]    1    0    0    0
[5,]    0    0    0    0

Now, we remove the zeros. Thus, we have to turn the object into a list.

vecDest <- split(t(vecDest), rep(1:nrow(vecDest), each = ncol(vecDest)))
lapply(vecDest,function(x) x[x>0])

$`1`
[1] 1 4

$`2`
[1] 1 4

$`3`
[1] 3

$`4`
[1] 1

$`5`
numeric(0)
Roland
  • 127,288
  • 10
  • 191
  • 288