0

I have read this file into a data.frame in R, and as you can see the 5th column contains some values separated with ";". Is it possible to turn this data.frame to a much larger data.frame and expand the 5th column into a binary vector?

> head(uinfo)
      V1   V2 V3  V4                             V5
1 100044 1899  1   5    831;55;198;8;450;7;39;5;111
2 100054 1987  2   6                              0
3 100065 1989  1  57                              0
4 100080 1986  1  31 113;41;44;48;91;96;42;79;92;35
5 100086 1986  1 129                              0
6 100097 1981  1  75                              0

So, as a simpler example, if my first two rows are:

1 100044 1899  1   5    1;2;4;7
2 100054 1987  2   6    3;8

I want to get:

1 100044 1899  1   5    1 1 0 1 0 0 1 0 0 0
2 100054 1987  2   6    0 0 1 0 0 0 0 1 0 0

Do I have to use another program such as python for preprocessing of the data, or is it possible to do so by some apply function?

Thanks

Vahid Mirjalili
  • 6,211
  • 15
  • 57
  • 80
  • Hello, a reproducible example (ie, something that can be cut+paste'd) would be helpful. You can use `reproduce( )`. Have a look at http://bit.ly/SORepro for instructions – Ricardo Saporta Oct 06 '13 at 17:01

3 Answers3

4

You can try the concat.split.expanded function from my "splitstackshape" package:

library(splitstackshape)
mydf
#       V1   V2 V3 V4      V5
# 1 100044 1899  1  5 1;2;4;7
# 2 100054 1987  2  6     3;8
concat.split.expanded(mydf, "V5", sep=";", fill = 0)
#       V1   V2 V3 V4      V5 V5_1 V5_2 V5_3 V5_4 V5_5 V5_6 V5_7 V5_8
# 1 100044 1899  1  5 1;2;4;7    1    1    0    1    0    0    1    0
# 2 100054 1987  2  6     3;8    0    0    1    0    0    0    0    1

Add drop = TRUE to get rid of the original column.


Here, "mydf" is defined as:

mydf <- structure(list(V1 = c(100044L, 100054L), V2 = c(1899L, 1987L), 
    V3 = 1:2, V4 = 5:6, V5 = c("1;2;4;7", "3;8")), .Names = c("V1", 
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -2L))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
1

The idea here is that there are two requirements:

  1. split the data on the semicolon`
  2. create the columns, filling empty cols with zero/FALSE

# 1 is straightforward: use strsplit.
# 2 can be accomplished by sequencing along the number of new columns, and checking if they're in the newly splatted string.

  library(data.table)

  largest <- 1e3  # (Whatever your largest expected value)
  newColNames <- as.character(seq(largest))
  dat[,  (newColNames) := as.data.table(t(sapply(strsplit(V5, ";"), "%in%", x=seq(largest))))]

  # if you really want numeric (as opposed to logical)
  dat[, (newColNames) := lapply(.SD, as.numeric), .SDcols=newColNames]
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
1

Using base functions (too many steps I think)

> df <- read.table(text=" 100044 1899  1   5    1;2;4;7
+  100054 1987  2   6    3;8", header=F, stringsAsFactors=F) # data.frame


> pos <- sapply(strsplit(as.character(df[,5]), ";"), as.numeric)
> x <-rep(0, max(unlist(pos)))
> cbind(df, t(sapply(pos, function(y) replace(x, y, 1))))
      V1   V2 V3 V4      V5 1 2 3 4 5 6 7 8
1 100044 1899  1  5 1;2;4;7 1 1 0 1 0 0 1 0
2 100054 1987  2  6     3;8 0 0 1 0 0 0 0 1
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138