Extract pattern from set of strings in R

Question

I am looking to parse through a dataset and match it up with a tree I have already made in R. I am trying to simplify the tip labels to be matched up with my phylogenetic tree.

For instance from the "gi|399148998|gb|JN638572|" and simplifying that down to just "JN638572" (the accession number); and I need to do this 61 times (61 samples). Each of the accession numbers start at the same position as well.

## thanks for the data serban
set.seed(1)

mydat <- replicate(61, paste0(paste0(sample(letters,2), collapse=""),"|",
                              round(runif(1,1e8,1e9-1)),"|",
                              paste0(sample(letters,2), collapse=""),"|",
                              paste0(sample(LETTERS,2), collapse=""),
                              round(runif(1,1e6,1e7-1)),"|"))
head(mydat)
# [1] "gj|615568026|xf|XZ6947179|" "qb|285377117|er|JT5479293|" "sy|442031661|ux|FQ2129996|"
# [4] "gj|112051300|jv|IM6396092|" "me|844635986|rt|CS4701469|" "vq|804639485|on|UA5295070|"

Instead of posting a screenshot, you could copy and paste the output of `dput(head(listoftiplabels))` — C_Z_, Dec 08 '15 at 20:50
You should include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in the question itself. No pictures of data. — MrFlick, Dec 08 '15 at 20:51
Basically, `unlist(strsplit("gi|399148998|gb|JN638572|", "[|]"))[4]` might work — Severin Pappadeux, Dec 08 '15 at 20:53
this should simply be `gsub('([A-Z]{2}\\d+)|.', '\\1', mydat)` if the desired part is always two capital letters followed by digits — rawr, Dec 08 '15 at 21:46

Serban Tanasa · Answer 1 · 2015-12-08T21:33:13.880

I would recommend against using for loops in R when you can avoid it. R can perform whole-vector operations. For your particular instance, this ought to do it:

 library(stringr)
 #Generate some data: 
 mydat <- replicate(61, paste0(paste0(sample(letters,2), collapse=""),"|",
                                round(runif(1,1e8,1e9-1)),"|",
                                paste0(sample(letters,2), collapse=""),"|",
                                paste0(sample(LETTERS,2), collapse=""),
                                round(runif(1,1e6,1e7-1)),"|"))
head(mydat)
[1] "pg|451576916|kj|FV9562908|" "dt|707843618|sj|KZ3658708|" 
    "lb|507989738|lc|ML2309736|" "nb|448725577|fo|DW1950100|"
[5] "iv|337265231|us|CR5163970|" "ew|254260770|rw|LB2404167|"
 #Stuff you actually need:     
 results <- str_match(mydat, ".{2}\\|.*\\|.{2}\\|(.*)\\|")[,2]
 #Results:
 head(results)  
 [1] "FV9562908" "KZ3658708" "ML2309736" "DW1950100" "CR5163970" "LB2404167"

I am using regex, which stands for regular expressions. It would work with just ".*\\|(.*)\\|" due to "greedy" interpreters, but I've made it needlessly complicated to make it easier to explain .{Nr} tells it to ignore Nr characters, and .* tells it to ignore as many characters as it takes to reach the next part of the pattern, namely \\|. The | is a special character and has to be "escaped" with \\ so that the regex processor can take it literally instead. The parentheses are the "Capture group", i.e. what you want returned.

str_match is a function in the stringr library (which you may have to install with install.packages("stringr")), it returns in the first column the whole pattern, if a match is found, then the next column will be the first capture group. I'm returning the second column only by using the [,2] notation.

Extract pattern from set of strings in R

1 Answers1