1

I have a data set like this, with a variable ("item") that contains comma-separated codes:

id  item
1    102, 103,401,
2    108,102,301
3    103, 108 , 405, 505, 708

For each id, I would like to get the frequencies of each separate item, like this:

id  102  103   104   108  301 401 ...
1    1    1                    1
2    1                 1    1
3         1            1

How can I do that?

peak
  • 105,803
  • 17
  • 152
  • 177

3 Answers3

3

We can do this with mtabulate from qdapTools

library(qdapTools)
cbind(dat['id'], mtabulate(strsplit(dat$item, '\\s*,\\s*')))
#  id 102 103 108 301 401 405 505 708
#1  1   1   1   0   0   1   0   0   0
#2  2   1   0   1   1   0   0   0   0
#3  3   0   1   1   0   0   1   1   1

NOTE: Data taken from @thelatemail's post.


Or another option (if we need a sparseMatrix)

library(Matrix)
#split the 'item' column to `list`
lst <- strsplit(dat$item, '\\s*,\\s*')
#get the `unique` elements after `unlist`ing.
Un1 <- sort(unique(unlist(lst)))
#create a `sparseMatrix` by specifying the row
#column index along with dim names (if needed)
sM <-  sparseMatrix(rep(dat$id, lengths(lst)), 
            match(unlist(lst), Un1), x= 1, 
             dimnames=list(dat$id, Un1))
sM
#    3 x 8 sparse Matrix of class "dgCMatrix"
#   102 103 108 301 401 405 505 708
#1   1   1   .   .   1   .   .   .
#2   1   .   1   1   .   .   .   .
#3   .   1   1   .   .   1   1   1

It can be converted to matrix by wrapping with as.matrix

as.matrix(sM)
#   102 103 108 301 401 405 505 708
#1   1   1   0   0   1   0   0   0
#2   1   0   1   1   0   0   0   0
#3   0   1   1   0   0   1   1   1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    thank you, but when I run the code q<-cbind(text['id'], mtabulate(strsplit(text$item, '\\s*,\\s*'))) I obtain this error Error in strsplit(text$item, "\\s*,\\s*") : non-character argument – user2609451 Feb 06 '16 at 09:10
  • 1
    @user2609451 In the dataset that I took from thelatemail's post, the 'item' column is `character`. If you have a `factor` class, convert it to `character` and it should work fine (as `strsplit` needs a `character` class column) ie. `mtabulate(strsplit(as.character(text$item),...` – akrun Feb 06 '16 at 09:13
  • Sorry, starting from my url dati<- ( read.csv(file='C:.....csv', header=TRUE, sep=";")) I've chosen my variabiles the first, id<-dati$post_visid_low...than...the second... item<-dati$event_list ...than..I combine those in dat.... dat<-as.character(c(id, item)) than I run your command lst <- strsplit(dat$item, '\\s*,\\s*') but I obtain this: Error in dat$item : $ operator is invalid for atomic vectors – user2609451 Feb 06 '16 at 09:40
  • I've resolved the above problem, but, when run the function sM , I receive this Error in i + (!(m.i || i1)) : non-numeric argument to binary operator – user2609451 Feb 06 '16 at 10:44
  • @user2609451 As I mentioned above, I was using thelatemail's data and it worked fine. Regarding the `$` operator problem, check whether you have a `matrix` as a dataset. – akrun Feb 06 '16 at 10:52
1

Use strsplit and then take advantage of a factor to make sure all the column counts are included.

spl <- strsplit(dat$item,"\\s*,\\s*")
ulevs <- sort(unique(unlist(spl)))
tab <- t(vapply(
  spl, 
  function(x) table(factor(x,levels=ulevs)),
  FUN.VALUE=numeric(length(ulevs))
))
cbind(dat["id"],tab)

#  id 102 103 108 301 401 405 505 708
#1  1   1   1   0   0   1   0   0   0
#2  2   1   0   1   1   0   0   0   0
#3  3   0   1   1   0   0   1   1   1

Data used:

dat <- read.table(text="id;item
1;102, 103,401,
2;108,102,301
3;103, 108 , 405, 505, 708",header=TRUE,sep=";",stringsAsFactors=FALSE)
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • 1
    Or a modification of your method will be `table(stack(setNames(spl, dat$id))[2:1])` – akrun Feb 06 '16 at 09:04
  • thank you..but when I run the first line I obtain this message Error in dat$item : $ operator is invalid for atomic vectors. I've imported my data set with this command dati<- ( read.csv(file='C....csv', header=TRUE, sep=";")), than I've selected my first variable id<-dati$myfirst and the second with item<-dati$event_list...than I've combined togheter with dat<-as.character(c(id, item)) – user2609451 Feb 06 '16 at 09:56
0

You can do this using strsplit function. Below is my solution

library(data.table)
id <- c(1:3)
item <- c("102, 103,401",  "108,102,301", "103, 108 , 405, 505, 708")
dataT <- data.table(id, item)

reqCol <- unlist(strsplit(dataT$item, split=","))
reqCol <- gsub(" ", "", reqCol)
reqCol <-  unique(reqCol)
reqColNames <- paste0("Col_", reqCol)

for(i in 1:length(reqCol)){
    expr1 <- parse(text = paste0(reqColNames[i], ":=0"))
    expr2 <- parse(text = paste0(reqColNames[i], ":=1"))
    dataT[, eval(expr1)]
    rowIndex <- grep(reqCol[i], dataT$item)
    dataT[rowIndex, eval(expr2)] 
}

I have used data.table instead of data.frame, because data.table is very fast as compared to data.frame.

Kumar Manglam
  • 2,780
  • 1
  • 19
  • 28
  • I would like to know, if the for loop can be removed using vectorized code.. – Kumar Manglam Feb 06 '16 at 06:38
  • thank you very much...but, when I run this comand reqCol <- unlist(strsplit(dataT$item, split=",")) I obtain this error: Error in strsplit(text$item, split = ",") : non-character argument – user2609451 Feb 06 '16 at 09:05