7

I have a huge dataset in which there is one column including several values for each subject (row). Here is a simplified sample dataframe:

data <- data.frame(subject = c(1:8), sex = c(1, 2, 2, 1, 2, 1, 1, 2), 
              age = c(35, 29, 31, 46, 64, 57, 49, 58), 
              v1 = c("2", "0", "3,5", "2 1", "A,4", "B,1,C", "A and B,3", "5, 6 A or C"))

> data
  subject sex age          v1
1       1   1  35           2
2       2   2  29           0
3       3   2  31         3,5  # separated by a comma
4       4   1  46         2 1  # separated by a blank space
5       5   2  64         A,4
6       6   1  57       B,1,C
7       7   1  49   A and B,3
8       8   2  58 5, 6 A or C

I first want to remove the letters (A, B, A and B, …) in the fourth column (v1), and then split the fourth column into multiple columns just like this:

  subject sex age x1 x2 x3 x4 x5 x6
1       1   1  35  0  1  0  0  0  0        
2       2   2  29  0  0  0  0  0  0
3       3   2  31  0  0  1  0  1  0  
4       4   1  46  1  1  0  0  0  0
5       5   2  64  0  0  0  1  0  0
6       6   1  57  1  0  0  0  0  0
7       7   1  49  0  0  1  0  0  0
8       8   2  58  0  0  0  0  1  1

where the 1st subject takes 1 at x2 because it takes 2 at v1 in the original dataset, the 3rd subject takes 1 at both x3 and x5 because it takes 3 and 5 at v1 in the original dataset, and so on.

I would appreciate any help on this question. Thanks a lot.

Arun
  • 116,683
  • 26
  • 284
  • 387
user187454
  • 119
  • 1
  • 4

2 Answers2

5

You can cbind this result to data[-4] and get what you need:

 0+t(sapply(as.character(data$v1), function(line) 
        sapply(1:6, function(x) x %in% unlist(strsplit(line, split="\\s|\\,"))) ))
#----------------
            [,1] [,2] [,3] [,4] [,5] [,6]
2              0    1    0    0    0    0
0              0    0    0    0    0    0
3,5            0    0    1    0    1    0
2 1            1    1    0    0    0    0
A,4            0    0    0    1    0    0
B,1,C          1    0    0    0    0    0
A and B,3      0    0    1    0    0    0
5, 6 A or C    0    0    0    0    1    1
IRTFM
  • 258,963
  • 21
  • 364
  • 487
4

One solution:

r <- sapply(strsplit(as.character(dt$v1), "[^0-9]+"), as.numeric)
m <- as.data.frame(t(sapply(r, function(x) {
        y <- rep(0, 6)
        y[x[!is.na(x)]] <- 1
        y
     })))
data <- cbind(data[, c("subject", "sex", "age")], m)

#   subject sex age V1 V2 V3 V4 V5 V6
# 1       1   1  35  0  1  0  0  0  0
# 2       2   2  29  0  0  0  0  0  0
# 3       3   2  31  0  0  1  0  1  0
# 4       4   1  46  1  1  0  0  0  0
# 5       5   2  64  0  0  0  1  0  0
# 6       6   1  57  1  0  0  0  0  0
# 7       7   1  49  0  0  1  0  0  0
# 8       8   2  58  0  0  0  0  1  1

Following DWin's awesome solution, m could be modified as:

m <- as.data.frame(t(sapply(r, function(x) {
        0 + 1:6 %in% x[!is.na(x)]
     })))
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks for your update. I have an additional question. Here I only give a simple dataset in which v1 takes the maximum value “6”. In my real data, since it is huge, is there an easy way to determine automatically the max value of v1 instead of specifying it by my self? Thanks. – user187454 Feb 07 '13 at 21:37