0

I have a dataframe, call it A, where the columns are Question 1, Question 2, Question 3 and so on, and the rows are indexed by the person who answered the questions, Person 1, Person 2, and so on. The questions have multiple choice answers, with varying lengths. So for instance, Question 50 may have 9 possible answers (each person can only choose 1 answer). So for instance, the entries in the column under Question 50 are numbers ranging from 1 to 9.

In order to do some PCA on this dataset, I need to convert these columns to binary form. So for instance, column Question 50 will be converted to 9 different columns: Q501, Q502, Q503....Q509. Then, Q50i,row K, will be 1 if Person K answered i to question 50, and 0 otherwise. In other words, I am making my columns indicator vectors for which response was given by which person.

I want to write a function that takes as input a column and does this binary coding of my dataset. I can do this for specifically one column, but when I try to convert the same syntax to a function (so I can apply the function to a range of columns) R can't seem to evaluate my variable. Since I have 122 columns to convert, a function really is necessary.

Here is what worked, for a specific column (50 in this case):

for (i in 1:max(A["Q50"])) {
  A[paste0("Q50",i)] <- ifelse( A["Q50"]==i,1,0)
}

Here is the function that I tried to write but didn't work:

binarize <- function(column) {
  for (i in 1:max(A["column"])) {
    A[paste0("column",i)] <- ifelse( A["column"]==i,1,0)
  }
}

The error I get is:

Error in `[.data.frame`(zip.lingLoc, "column") : 
  undefined columns selected 
Traceback:
4 stop("undefined columns selected") 
3 `[.data.frame`(zip.lingLoc, "column") 
2 zip.lingLoc["column"] 
1 binarize("Q053") 

Here is an example:

A is the following dataframe.

   ID Q050
1   1    4
2   2    4
3   3    4
4   4    7
5   5    8
6   6    8
7   7    7
8   8    4
9   9    7
10 10    7

Now I apply the thing that works:

for (i in 1:max(A["Q050"])) {
  A[paste0("Q050",i)] <- ifelse( A["Q050"]==i,1,0)
}

And A becomes:

   ID Q050 Q050 Q050 Q050 Q050 Q050 Q050 Q050 Q050
1   1    4    0    0    0    1    0    0    0    0
2   2    4    0    0    0    1    0    0    0    0
3   3    4    0    0    0    1    0    0    0    0
4   4    7    0    0    0    0    0    0    1    0
5   5    8    0    0    0    0    0    0    0    1
6   6    8    0    0    0    0    0    0    0    1
7   7    7    0    0    0    0    0    0    1    0
8   8    4    0    0    0    1    0    0    0    0
9   9    7    0    0    0    0    0    0    1    0
10 10    7    0    0    0    0    0    0    1    0

Which is great, but if I apply my previous functions binarize to it, I just get the same error as I noted above.

My questions are, what is wrong with my function binarize. And is this the best way for me to do this? Thank you!

Jaap
  • 81,064
  • 34
  • 182
  • 193
Lisha
  • 23
  • 4
  • 2
    Make your problem [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Add some sample input and desired output. you definitely have problems with things like `max(A["column"])` where I think you mean `max(A[[column]])` but without test data it's difficult to verify. – MrFlick Oct 03 '14 at 05:49
  • See ``?model.matrix`` – orizon Oct 15 '14 at 09:02

4 Answers4

1

'dcast' function of 'reshape2' package may be useful here:

> dd = data.frame(ID=letters[1:10], Q050=sample(1:10, 10))
> dd
   ID Q050
1   a    5
2   b    2
3   c    9
4   d    8
5   e    1
6   f    7
7   g    4
8   h    3
9   i   10
10  j    6
> 
> library(reshape2)
> dcast(dd, ID~Q050)
Using Q050 as value column: use value.var to override.
   ID  1  2  3  4  5  6  7  8  9 10
1   a NA NA NA NA  5 NA NA NA NA NA
2   b NA  2 NA NA NA NA NA NA NA NA
3   c NA NA NA NA NA NA NA NA  9 NA
4   d NA NA NA NA NA NA NA  8 NA NA
5   e  1 NA NA NA NA NA NA NA NA NA
6   f NA NA NA NA NA NA  7 NA NA NA
7   g NA NA NA  4 NA NA NA NA NA NA
8   h NA NA  3 NA NA NA NA NA NA NA
9   i NA NA NA NA NA NA NA NA NA 10
10  j NA NA NA NA NA  6 NA NA NA NA
> 

Convert all positive values to 1 for binary:

> outdf[outdf>0]=1
Warning message:
In Ops.factor(left, right) : > not meaningful for factors
> outdf
   ID  1  2  3  4  5  6  7  8  9 10
1   a NA NA NA NA  1 NA NA NA NA NA
2   b NA  1 NA NA NA NA NA NA NA NA
3   c NA NA NA NA NA NA NA NA  1 NA
4   d NA NA NA NA NA NA NA  1 NA NA
5   e  1 NA NA NA NA NA NA NA NA NA
6   f NA NA NA NA NA NA  1 NA NA NA
7   g NA NA NA  1 NA NA NA NA NA NA
8   h NA NA  1 NA NA NA NA NA NA NA
9   i NA NA NA NA NA NA NA NA NA  1
10  j NA NA NA NA NA  1 NA NA NA NA

If needed, convert all NA to 0:

> outdf[is.na(outdf)]=0
> outdf
   ID 1 2 3 4 5 6 7 8 9 10
1   a 0 0 0 0 1 0 0 0 0  0
2   b 0 1 0 0 0 0 0 0 0  0
3   c 0 0 0 0 0 0 0 0 1  0
4   d 0 0 0 0 0 0 0 1 0  0
5   e 1 0 0 0 0 0 0 0 0  0
6   f 0 0 0 0 0 0 1 0 0  0
7   g 0 0 0 1 0 0 0 0 0  0
8   h 0 0 1 0 0 0 0 0 0  0
9   i 0 0 0 0 0 0 0 0 0  1
10  j 0 0 0 0 0 1 0 0 0  0
rnso
  • 23,686
  • 25
  • 112
  • 234
  • Thanks rnso! How can I do this for many columns at once and get one dataframe in the end. I need to do this for about 100 columns. – Lisha Oct 03 '14 at 06:35
0

You may try (for multiple columns using base R)

 set.seed(45)
 dat1 <- setNames(as.data.frame(matrix(sample(1:9, 10*20, replace=TRUE),
              ncol=10)), paste0("Q", 1:10))
 dat1$ID <- 1:20


 lst <- lapply(dat1[, grep("^Q", colnames(dat1))], function(x) {
          x1 <- data.frame(x, ID = factor(dat1[, "ID"], levels = unique(dat1[, "ID"])))
          x2 <- as.data.frame.matrix(table(x1[, 2:1]))
          x3 <- data.frame(ID = rownames(x2), x, x2, check.names = FALSE)
  })



  lst2 <- Map(function(x, y) {
                  colnames(x)[-1] <- paste0(y, colnames(x)[-1])
                        x
                }, lst, names(lst))


  lst2[[1]]
  #   ID Q1x Q11 Q12 Q13 Q14 Q15 Q16 Q18 Q19
  #1   1   6   0   0   0   0   0   1   0   0
  #2   2   3   0   0   1   0   0   0   0   0
  #3   3   3   0   0   1   0   0   0   0   0
  #4   4   4   0   0   0   1   0   0   0   0
  #5   5   4   0   0   0   1   0   0   0   0
  #6   6   3   0   0   1   0   0   0   0   0
  #7   7   3   0   0   1   0   0   0   0   0
  #8   8   5   0   0   0   0   1   0   0   0
  #9   9   2   0   1   0   0   0   0   0   0
  #10 10   1   1   0   0   0   0   0   0   0
  #11 11   4   0   0   0   1   0   0   0   0
  #12 12   8   0   0   0   0   0   0   1   0
  #13 13   3   0   0   1   0   0   0   0   0
  #14 14   5   0   0   0   0   1   0   0   0
  #15 15   4   0   0   0   1   0   0   0   0
  #16 16   4   0   0   0   1   0   0   0   0
  #17 17   9   0   0   0   0   0   0   0   1
  #18 18   2   0   1   0   0   0   0   0   0
  #19 19   4   0   0   0   1   0   0   0   0
  #20 20   3   0   0   1   0   0   0   0   0

Update

If you want to get a single data.frame

Try:

  res <-  data.frame(lst2[[1]][,1, drop=FALSE],
              do.call(`cbind`, lapply(lst2, `[`, -1)))
  colnames(res) <- gsub(".*\\.|[x]", "", colnames(res))
  res[1:3,1:3]
  # ID Q1 Q11
  #1  1  6   0
  #2  2  3   0
  #3  3  3   0

 dim(res)
 #[1] 20 91
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks Akrun! One more question, I want to have one dataframe in the end with all of these columns, instead of doing a join of all the lst entries, is there a way to do this with your original function? – Lisha Oct 03 '14 at 06:17
0

For multiple columns:

> dd = data.frame(ID=letters[1:15], Q050=sample(1:10, 15, replace=T), Q010=sample(1:10, 15, replace=T), Q020=sample(1:10, 15, replace=T))

Use 'melt' of reshape2 to get long form:

> library(reshape2)
> mm = melt(dd, id='ID')
> head(mm)
  ID variable value
1  a     Q050     2
2  b     Q050     2
3  c     Q050     8
4  d     Q050     1
5  e     Q050     3
6  f     Q050     4

Split this long form into data frame with different questions:

> ll =split(mm, mm$variable)
> ll
$Q050
   ID variable value
1   a     Q050     2
2   b     Q050     2
3   c     Q050     8
4   d     Q050     1
5   e     Q050     3
6   f     Q050     4
7   g     Q050     8
8   h     Q050     5
9   i     Q050     4
10  j     Q050     1
11  k     Q050     5
12  l     Q050     4
13  m     Q050    10
14  n     Q050     2
15  o     Q050     2

$Q010
   ID variable value
16  a     Q010    10
17  b     Q010     6
18  c     Q010     6
19  d     Q010     6
20  e     Q010     2
21  f     Q010     2
22  g     Q010     6
23  h     Q010     4
24  i     Q010    10
25  j     Q010     4
26  k     Q010     9
27  l     Q010     8
28  m     Q010     7
29  n     Q010    10
30  o     Q010     6

$Q020
   ID variable value
31  a     Q020     7
32  b     Q020     7
33  c     Q020     9
34  d     Q020    10
35  e     Q020     6
36  f     Q020     6
37  g     Q020     2
38  h     Q020     4
39  i     Q020     7
40  j     Q020     1
41  k     Q020     4
42  l     Q020     4
43  m     Q020     5
44  n     Q020     8
45  o     Q020     5

Use 'sapply' function to perform 'dcast' on every data frame/ question:

sapply(ll, function(x) dcast(x, ID~value))

> sapply(ll, function(x) dcast(x, ID~value))
$Q050
   ID  1  2  3  4  5  8 10
1   a NA  2 NA NA NA NA NA
2   b NA  2 NA NA NA NA NA
3   c NA NA NA NA NA  8 NA
4   d  1 NA NA NA NA NA NA
5   e NA NA  3 NA NA NA NA
6   f NA NA NA  4 NA NA NA
7   g NA NA NA NA NA  8 NA
8   h NA NA NA NA  5 NA NA
9   i NA NA NA  4 NA NA NA
10  j  1 NA NA NA NA NA NA
11  k NA NA NA NA  5 NA NA
12  l NA NA NA  4 NA NA NA
13  m NA NA NA NA NA NA 10
14  n NA  2 NA NA NA NA NA
15  o NA  2 NA NA NA NA NA

$Q010
   ID  2  4  6  7  8  9 10
1   a NA NA NA NA NA NA 10
2   b NA NA  6 NA NA NA NA
3   c NA NA  6 NA NA NA NA
4   d NA NA  6 NA NA NA NA
5   e  2 NA NA NA NA NA NA
6   f  2 NA NA NA NA NA NA
7   g NA NA  6 NA NA NA NA
8   h NA  4 NA NA NA NA NA
9   i NA NA NA NA NA NA 10
10  j NA  4 NA NA NA NA NA
11  k NA NA NA NA NA  9 NA
12  l NA NA NA NA  8 NA NA
13  m NA NA NA  7 NA NA NA
14  n NA NA NA NA NA NA 10
15  o NA NA  6 NA NA NA NA

$Q020
   ID  1  2  4  5  6  7  8  9 10
1   a NA NA NA NA NA  7 NA NA NA
2   b NA NA NA NA NA  7 NA NA NA
3   c NA NA NA NA NA NA NA  9 NA
4   d NA NA NA NA NA NA NA NA 10
5   e NA NA NA NA  6 NA NA NA NA
6   f NA NA NA NA  6 NA NA NA NA
7   g NA  2 NA NA NA NA NA NA NA
8   h NA NA  4 NA NA NA NA NA NA
9   i NA NA NA NA NA  7 NA NA NA
10  j  1 NA NA NA NA NA NA NA NA
11  k NA NA  4 NA NA NA NA NA NA
12  l NA NA  4 NA NA NA NA NA NA
13  m NA NA NA  5 NA NA NA NA NA
14  n NA NA NA NA NA NA  8 NA NA
15  o NA NA NA  5 NA NA NA NA NA
rnso
  • 23,686
  • 25
  • 112
  • 234
0

With explanations of each step and using arguments to avoid calling extra methods.

1) Use grep to get all item from column names in df that start, hence the "^", with Q.
We pass the argument value=TRUE such that we get the exact column names, not their indices.

   questions.labels <- grep("^Q", colnames(df), value=TRUE)


2) Use melt, from the library reshape2, to create a "tall" data frame, that is, columns will be
ID(1,2,3...), id argument
variable(Q050,Q050,Q051...), measure.vars argument
value(3,4,5,6)

library("reshape2")
    df.melt <- melt(df, 
                    id='ID', 
                    measure.vars=questions.labels)


3) Split the data frame by variable (previously measure.vars) such that you can traverse them individually.

    df.split <- split(df.melt, df.melt$variable)


4) Use sapply to cast the values of the variables on each split data frame.
For casting the values, use dcast from reshape2.

function(x) 1 in dcast is set such that each existing value is replaced by 1
fill = 1 is set such that each non-existing value is replaced by 0

    df.bin <- sapply(df.split, function(x) dcast(x, ID~value, function(x) 1, fill=0))

This is considerably slow and I would love to see faster implementations!
Hope this helps!

rafaelvalle
  • 6,683
  • 3
  • 34
  • 36