3

I have a dataset like this from a file read into R with read.table():

Nr Result1
 1 "A203,A305,A409,B309,B424,B545"
 2 "A190,A203,A305,B309,B425,B545"
 3 "A203,A305,A410,B280,B309,B425,B545"

Result1 is a string that I would like to split at the "," to count the occurrence of each element in each row. I would like to count the different elements and write out the result in this format:

A190 A203 A305 A409 A410 B280 B309 B424 B425 B545
1    3    3    1    1    1    3    1    2    3

My first thought is to loop over every row, split the string into single elements, create a vector with the first set of elements and for the second row, check if matching element already exists (count+1) or append unknown element to vector with count=1.

I am quite new to R and would appreciate some example code or hints how to implement the single steps with R functions! Many Thanks

user3634356
  • 33
  • 1
  • 3

3 Answers3

7

I think this is what you are looking for:

newvector <- table(unlist(strsplit(as.character(df$Result1), ",")))

Result (stored in newvector):

#>newvector
#A190 A203 A305 A409 A410 B280 B309 B424 B425 B545 
#   1    3    3    1    1    1    3    1    2    3 

The strsplit function splits the vector of strings (Result1) at each comma. The result is a list for each row of your data.frame (df in my example). To turn this list into a vector you use unlist. The table function then creates a table with the frequencies.

talat
  • 68,970
  • 21
  • 126
  • 157
1

Here's an alternative to consider:

The development version of my "splitstackshape" package has a function called concat.split.expanded that "expands" values to their relevant column position in binary representation. You can then drop the first column and take the column sums (using colSums).

Here's some sample data:

mydf <- data.frame(Nr = 1:3,
                   Result1 = c("A190,A203,A305,B309,B425,B545", 
                               "A203,A305,A409,B309,B424,B545", 
                               "A203,A305,A410,B280,B309,B425,B545"))

And here's the implementation I would recommend:

## library(devtools)
## install_github("splitstackshape", "mrdwab", ref = "devel")
library(splitstackshape) ## Install "devel" version from github
colSums(concat.split.expanded(mydf, "Result1", ",", type="character", 
                              fill=0, drop = TRUE)[-1])
# Result1_A190 Result1_A203 Result1_A305 Result1_A409 Result1_A410 Result1_B280 
#            1            3            3            1            1            1 
# Result1_B309 Result1_B424 Result1_B425 Result1_B545 
#            3            1            2            3 

In the long run, this will actually be faster than table:

mydf <- do.call(rbind, replicate(30000, mydf, FALSE))
dim(mydf)
# [1] 90000     2

fun1 <- function() {
  colSums(concat.split.expanded(mydf, "Result1", ",", 
                                type="character", fill=0, drop = TRUE)[-1])
} 
fun2 <- function() table(unlist(strsplit(as.character(mydf$Result1), ",")))
system.time(out1 <- fun1())
#    user  system elapsed 
#    6.10    0.00    6.09 
system.time(out2 <- fun2())
#    user  system elapsed 
#    0.77    0.00    0.76 
all.equal(as.vector(out1), as.vector(out2))
# [1] TRUE
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks for your alternative, I will definitely come back to this solution once my R scripting becomes more serious! – user3634356 May 16 '14 at 21:23
0

use table(unlist(strsplit(yourstring, ","))) Example: if your string is example = "A203,A305,A409,B309,B424,B545", the code below will produce:

R:> example = "A203,A305,A409,B309,B424,B545"
R:> table(unlist(strsplit(example, ",")))

A203 A305 A409 B309 B424 B545 
 1    1    1    1    1    1 

A more complete example would be:

R:> ex <- data.frame(rbind(c(1, "A203,A305,A409,B309,B424,B545"), 
                           c(2, "A190,A203,A305,B309,B425,B545"),  
                           c(3, "A203,A305,A410,B280,B309,B425,B545")))
R:> ex
  X1                                 X2
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> names(ex) <- c("Nr", "Result1")
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545

R:> typeof(ex$Nr)
[1] "integer"
R:> typeof(ex$Result1)
[1] "integer"
R:> ex$Result1 <- as.character(ex$Result1)
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> table(unlist(strsplit(ex$Result[1], ",")))

A203 A305 A409 B309 B424 B545 
   1    1    1    1    1    1 

Use sapply() for the entire column (Result1), all at once. See bellow.

R:> sapply(ex$Result1, function(x) {table(unlist(strsplit(x, ",")))})
$`A203,A305,A409,B309,B424,B545`

A203 A305 A409 B309 B424 B545 
  1    1    1    1    1    1 

$`A190,A203,A305,B309,B425,B545`

A190 A203 A305 B309 B425 B545 
  1    1    1    1    1    1 

$`A203,A305,A410,B280,B309,B425,B545`

A203 A305 A410 B280 B309 B425 B545 
  1    1    1    1    1    1    1 
flamenco
  • 2,702
  • 5
  • 30
  • 46