Split strings in a matrix column and count the single elements in a new vector

Question

I have a dataset like this from a file read into R with read.table():

Nr Result1
 1 "A203,A305,A409,B309,B424,B545"
 2 "A190,A203,A305,B309,B425,B545"
 3 "A203,A305,A410,B280,B309,B425,B545"

Result1 is a string that I would like to split at the "," to count the occurrence of each element in each row. I would like to count the different elements and write out the result in this format:

A190 A203 A305 A409 A410 B280 B309 B424 B425 B545
1    3    3    1    1    1    3    1    2    3

My first thought is to loop over every row, split the string into single elements, create a vector with the first set of elements and for the second row, check if matching element already exists (count+1) or append unknown element to vector with count=1.

I am quite new to R and would appreciate some example code or hints how to implement the single steps with R functions! Many Thanks

talat · Accepted Answer · 2014-05-13T22:33:02.107

7

I think this is what you are looking for:

newvector <- table(unlist(strsplit(as.character(df$Result1), ",")))

Result (stored in newvector):

#>newvector
#A190 A203 A305 A409 A410 B280 B309 B424 B425 B545 
#   1    3    3    1    1    1    3    1    2    3

The strsplit function splits the vector of strings (Result1) at each comma. The result is a list for each row of your data.frame (df in my example). To turn this list into a vector you use unlist. The table function then creates a table with the frequencies.

edited May 13 '14 at 22:33

answered May 13 '14 at 22:26

talat

68,970
21
126
157

My upvote is only provisional. You _should_ wrap `as.character` around `df$Result1` – IRTFM May 13 '14 at 22:35
@BondedDust thanks, I updated the answer accordingly. – talat May 13 '14 at 22:37
Might be time to change your username to *inteRmediate* ;) – Rich Scriven May 13 '14 at 23:22
@RichardScriven I'll consider that :) – talat May 14 '14 at 04:25
Thanks for this quick answer, it worked perfectly for me. Nice to see how responsive this forum is! – user3634356 May 16 '14 at 21:20

score 1 · Answer 2 · answered May 14 '14 at 05:29

Here's an alternative to consider:

The development version of my "splitstackshape" package has a function called concat.split.expanded that "expands" values to their relevant column position in binary representation. You can then drop the first column and take the column sums (using colSums).

Here's some sample data:

mydf <- data.frame(Nr = 1:3,
                   Result1 = c("A190,A203,A305,B309,B425,B545", 
                               "A203,A305,A409,B309,B424,B545", 
                               "A203,A305,A410,B280,B309,B425,B545"))

And here's the implementation I would recommend:

## library(devtools)
## install_github("splitstackshape", "mrdwab", ref = "devel")
library(splitstackshape) ## Install "devel" version from github
colSums(concat.split.expanded(mydf, "Result1", ",", type="character", 
                              fill=0, drop = TRUE)[-1])
# Result1_A190 Result1_A203 Result1_A305 Result1_A409 Result1_A410 Result1_B280 
#            1            3            3            1            1            1 
# Result1_B309 Result1_B424 Result1_B425 Result1_B545 
#            3            1            2            3

In the long run, this will actually be faster than table:

mydf <- do.call(rbind, replicate(30000, mydf, FALSE))
dim(mydf)
# [1] 90000     2

fun1 <- function() {
  colSums(concat.split.expanded(mydf, "Result1", ",", 
                                type="character", fill=0, drop = TRUE)[-1])
} 
fun2 <- function() table(unlist(strsplit(as.character(mydf$Result1), ",")))
system.time(out1 <- fun1())
#    user  system elapsed 
#    6.10    0.00    6.09 
system.time(out2 <- fun2())
#    user  system elapsed 
#    0.77    0.00    0.76 
all.equal(as.vector(out1), as.vector(out2))
# [1] TRUE

Thanks for your alternative, I will definitely come back to this solution once my R scripting becomes more serious! — user3634356, May 16 '14 at 21:23

flamenco · Answer 3 · 2014-05-13T23:50:47.310

use table(unlist(strsplit(yourstring, ","))) Example: if your string is example = "A203,A305,A409,B309,B424,B545", the code below will produce:

R:> example = "A203,A305,A409,B309,B424,B545"
R:> table(unlist(strsplit(example, ",")))

A203 A305 A409 B309 B424 B545 
 1    1    1    1    1    1

A more complete example would be:

R:> ex <- data.frame(rbind(c(1, "A203,A305,A409,B309,B424,B545"), 
                           c(2, "A190,A203,A305,B309,B425,B545"),  
                           c(3, "A203,A305,A410,B280,B309,B425,B545")))
R:> ex
  X1                                 X2
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> names(ex) <- c("Nr", "Result1")
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545

R:> typeof(ex$Nr)
[1] "integer"
R:> typeof(ex$Result1)
[1] "integer"
R:> ex$Result1 <- as.character(ex$Result1)
R:> ex
  Nr                            Result1
1  1      A203,A305,A409,B309,B424,B545
2  2      A190,A203,A305,B309,B425,B545
3  3 A203,A305,A410,B280,B309,B425,B545
R:> table(unlist(strsplit(ex$Result[1], ",")))

A203 A305 A409 B309 B424 B545 
   1    1    1    1    1    1

Use sapply() for the entire column (Result1), all at once. See bellow.

R:> sapply(ex$Result1, function(x) {table(unlist(strsplit(x, ",")))})
$`A203,A305,A409,B309,B424,B545`

A203 A305 A409 B309 B424 B545 
  1    1    1    1    1    1 

$`A190,A203,A305,B309,B425,B545`

A190 A203 A305 B309 B425 B545 
  1    1    1    1    1    1 

$`A203,A305,A410,B280,B309,B425,B545`

A203 A305 A410 B280 B309 B425 B545 
  1    1    1    1    1    1    1

Split strings in a matrix column and count the single elements in a new vector

3 Answers3

Linked

Related