1

Starting from a data imported with

dati<- ( read.csv(file='C:...csv', header=TRUE, sep=";"))

I've chosen two variables

id<-dati$post_visid_low
item<-dati$event_list

than

id<-as.character(id)
item<-as.character(item)

dataT <- data.table(id, item) The structure of dataT is

id   item
1    102, 104, 108,401
2    405, 103, 650, 555, 450
3    305, 109

I want obtain this matrix of frequences with ordined columns

id  102  103  104  108 109  305  401   405   450    555   650
1    1         1    1
2         1                             1     1      1
3                        1    1

How can I do this? I tried with

library(Matrix)
id<-as.character(id)
item<-as.character(item)
dataT <- data.table(id, item)
lst <- strsplit(dataT$item, '\\s*,\\s*')
Un1 <- sort(unique(unlist(lst)))
sM <-  sparseMatrix(rep(dataT$id, length(lst)), 
                    match(unlist(lst), Un1), x= 1, 
                    dimnames=list(dataT$id, Un1))

But i recevive this error

Error in i + (!(m.i || i1)) : non-numeric argument to binary operator

How I can do that?

  • Extending your approach of splitting item, you can do `idx <- with(d, sort(unique(as.numeric(unlist(strsplit(item, ",")))))); s <- sapply(idx, function(x) grepl(x, d$item)) + 0L ; colnames(s) <- idx` [Which is *nice* as it uses just about every function base R] – user20650 Feb 06 '16 at 12:20

1 Answers1

2

We can use the package splitstackshape to help us with the splitting, and then a combination of melting and dcasting to get our data the format you specified (note that it's not always practical to have numerical column names.

library(splitstackshape)

# split the data
step1 <- cSplit(dat, splitCols="item")
step1
#    id item_1 item_2 item_3 item_4 item_5
# 1:  1    102    104    108    401     NA
# 2:  2    405    103    650    555    450
# 3:  3    305    109     NA     NA     NA

# reshape it and remove missings
step2 <- melt(step1, id.vars="id")[!is.na(value),]

# turn to wide
output <- dcast(step2, id~value, fun.aggregate = length)

# or in one line

output <- dcast(melt(cSplit(dat, splitCols="item"), id.vars="id")[!is.na(value),], 
                id~value, fun.aggregate = length)

output
#    id 102 103 104 108 109 305 401 405 450 555 650
# 1:  1   1   0   1   1   0   0   1   0   0   0   0
# 2:  2   0   1   0   0   0   0   0   1   1   1   1
# 3:  3   0   0   0   0   1   1   0   0   0   0   0

Alternatively, you can use cSplit_e from the same package:

cSplit_e(dat, "item", ",", type = "character", fill = 0, drop = TRUE)
  id item_102 item_103 item_104 item_108 item_109 item_305 item_401 item_405 item_450 item_555 item_650
# 1  1        1        0        1        1        0        0        1        0        0        0        0
# 2  2        0        1        0        0        0        0        0        1        1        1        1
# 3  3        0        0        0        0        1        1        0        0        0        0        0

Data used:

dat <- data.frame(id=1:3, item=c("102, 104, 108,401","405, 103, 650, 555, 450","305, 109"))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Heroka
  • 12,889
  • 1
  • 28
  • 38