How to very efficiently extract specific pattern from characters?

Question

I have big data like this :

> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5        
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5   
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5      
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5     
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5        
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5

what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with

paste(a,b, sep="-")

for example, the expected output from first two rows would be like :

hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9

so, the final implementation is like this:

for(i in 1:nrow(Data)){
    Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
    Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
    Data[i,4] <- as.numeric(sub("pvalue=","",Name))
    print(i)
}

which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?

We don't have `Data` so when formulating questions like this it is better to show the data like this: `x <- Data[1:7, 1]; dput(x)` — G. Grothendieck, Jan 06 '15 at 14:34
your edit made this a little bit of a moving target -- it wasn't initially clear that you needed a computationally efficient solution. I would encourage you to post your own answer to this question that shows benchmarks for a reasonably large data set (e.g. try it on the first 100,000 rows of your data set) for *all* of the answers provided below, following the format given in the answers to [this question](http://stackoverflow.com/questions/27747426/how-to-efficiently-read-the-first-character-from-each-line-of-a-text-file/). — Ben Bolker, Jan 06 '15 at 20:22
You might also check out the [stringi package](http://cran.r-project.org/web/packages/stringi/index.html), which is intended for fast string processing, and the `data.table` and or `dplyr` packages — Ben Bolker, Jan 06 '15 at 20:22

score 11 · Answer 1 · answered Jan 06 '15 at 13:55

If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:

sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])

score 5 · Answer 2 · answered Jan 06 '15 at 14:02

Another option is to use read.table with = as a separator then pasting the 2 columns:

res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2),            ## get rid from last part here
      sub('^ +| +$','',res$V4),sep='-')  ## remove extra spaces 

[1] "hsa-miR-5087-OR4F5"     "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"      "hsa-miR-659-3p-OR4F5"  
[5] "hsa-miR-5197-3p-OR4F5"  "hsa-miR-5093-OR4F5"     "hsa-miR-650-OR4F5"

G. Grothendieck · Answer 3 · 2015-01-06T17:22:47.967

The simple sub solution already given looks quite nice but just in case here are some other approaches:

1) read.pattern Using read.pattern in the gsubfn package we can parse the data into a data.frame. This intermediate form, DF, can then be manipulated in many ways. In this case we use paste in essentially the same way as in the question:

library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")

giving:

[1] "hsa-miR-5087-OR4F5"     "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"     
[4] "hsa-miR-659-3p-OR4F5"   "hsa-miR-5197-3p-OR4F5"  "hsa-miR-5093-OR4F5"    
[7] "hsa-miR-650-OR4F5"

The intermediate data frame, DF, that was produced looks like this:

> DF
      V1               V2         V3      V4   V5    V6
1 mature     hsa-miR-5087 mir_Family       - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family  mir-26 Gene OR4F9
3 mature      hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature   hsa-miR-659-3p mir_Family       - Gene OR4F5
5 mature  hsa-miR-5197-3p mir_Family       - Gene OR4F5
6 mature     hsa-miR-5093 mir_Family       - Gene OR4F5
7 mature      hsa-miR-650 mir_Family mir-650 Gene OR4F5

Here is a visualization of the regular expression we used:

(\w+)=([^|]*)

Regular expression visualization

Debuggex Demo

1a) names We could make DF look nicer by reading the three columns of data and the three names separately. This also improves the paste statement:

DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))

paste(DF$mature, DF$Gene, sep = "-") # same answer as above

The DF in this section that was produced looks like this. It has 3 instead of 6 columns and remaining columns were used to determine appropriate column names:

> DF
            mature mir_Family  Gene
1     hsa-miR-5087          - OR4F5
2 hsa-miR-26a-1-3p     mir-26 OR4F9
3      hsa-miR-448    mir-448 OR4F5
4   hsa-miR-659-3p          - OR4F5
5  hsa-miR-5197-3p          - OR4F5
6     hsa-miR-5093          - OR4F5
7      hsa-miR-650    mir-650 OR4F5

2) strapplyc

Another approach using the same package. This extracts the fields coming after a = and not containing a | producing a list. We then sapply over that list pasting the first and third fields together:

sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))

giving the same result.

Here is a visualization of the regular expression used:

=([^|]*)

Regular expression visualization

Debuggex Demo

score 4 · Answer 4 · answered Jan 06 '15 at 13:52

Here is one approach:

Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5        
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5   
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5      
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5     
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5        
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5"     "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"      "hsa-miR-659-3p-OR4F5"   "hsa-miR-5197-3p-OR4F5"  "hsa-miR-5093-OR4F5"    
# [7] "hsa-miR-650-OR4F5"

Cath · Answer 5 · 2015-01-06T13:58:36.110

Maybe not the more elegant but you can try :

sapply(Data[,1],function(x){
                   parts<-strsplit(x,"\\|")[[1]]
                   y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
                   return(y)
                })

Example

 Data<-data.frame(col1=c("mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5","mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"),col2=1:2,stringsAsFactors=F)

> Data[,1]
[1] "mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5"          "mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"

> sapply(Data[,1],function(x){
+                        parts<-strsplit(x,"\\|")[[1]]
+                        y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
+                        return(y)
+                     })
         mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9 
                                "hsa-miR-5087-OR4F5"                             "hsa-miR-26a-1-3p-OR4F9"

How to very efficiently extract specific pattern from characters?

5 Answers5