Is there a pattern matching and catching functionality in R as there is in perl

Question

I have a text file and am trying to extract patterns. In perl, I used $1 with matching, as in What does $1 mean in Perl?

I wonder if R has similar capabilities. Here is the sample I mean

a=readLines('xxx.txt')
"{1:F21CRESUS33XLIQ9590112170}{2:O1030747170228BNPAGB22XCIT95901121701702280629U}{3:{103:GLBH}}{4"
 [914] ":20:PK836J9GD2HI7SWQ"                                                                            
 [915] ":23B:CRED"                                                                                       
 [916] ":32A:170214USD2154,252"                                                                          
 [917] "50:ABNYUS33XXXX"                                                                                 
 [918] "/1Jose Lugo"                                                                                     
 [919] "/2931 Corte De Luna"                                                                             
 [920] "/3 Seattle"                                                                                      
 [921] "/498104, United States"                                                                          
 [922] "59F:BPAHCUHHXXXX"                                                                                
 [923] "/1 Jossef   Goldberg"                                                                            
 [924] "/21220 Bradford Way"                                                                             
 [925] "/3 Seattle"                                                                                      
 [926] "/498104, United States"                                                                          
 [927] ":71A:OUR"                                                                                        
 [928] "-}"                                                                                              
 [929] "{1:L01BARCGB21X05G7115765182}{2:O1030946170226ABBYGB2LXXXX71157651821702261046S}{3:{103:WJHX}}{4"
 [930] ":20:YZDSKFJNXV4BE3MP"                                                                            
 [931] ":23B:CRED"                                                                                       
 [932] ":32A:170214USD63362,31"                                                                          
 [933] "50:ABBYGB2LXXXX"                                                                                 
 [934] "/1Jossef Goldberg"                                                                               
 [935] "/21220 Bradford Way"                                                                             
 [936] "/3 Seattle"

what I would like to do is test a condition that if the line stars with a specific character, then search the line to see if it contains another pattern, and $pull$ them out (as in perl).

In pseudocode:

pattern1='^{1:'
pattern2='CU'
if (!is.null(line)){
  if(grep(pattern1, line)){
    if(grep(pattern2,line)){ print(substr(line,a,b), plus some other patters if they match the regex)
    }                       
  }
}

I also wonder how I can get it initiated to start reading through the lines

Maybe something like `x <- paste0(a, collapse = "\n")` and then `regmatches(x, gregexpr("(?sm:^{1.*?CU)(?-s:.*)", x, perl=TRUE))`? — Wiktor Stribiżew, Feb 14 '17 at 14:52
that works Wiktor. What does the ?sm: and ?s: stand for in the gregexpr? Thanks — frank, Feb 14 '17 at 15:06

score 2 · Accepted Answer · answered Feb 14 '17 at 15:13

First, I suggest joining the lines with \n:

x <- paste0(a, collapse = "\n")

Then you may grab your matches with

regmatches(x, gregexpr("(?sm:^{1.*?CU)(?-s:.*)", x, perl=TRUE))

The (?sm:^{1.*?CU)(?-s:.*) is a regex pattern that matches:

(?sm:^{1.*?CU) - a start of line (^ will match line start positions as ?m enables this behavior), then {1 literal char sequence, then .*? will match any 0+ chars (as (?s) makes the . match any symbol including a newline) as few times as possible (as *? is a lazy quantifier) up to the first CU
(?-s:.*) - the .* will match the rest of the line (the (?-s:) group turns off the DOTALL modifier that was enabled by (?s) previously).

Is there a pattern matching and catching functionality in R as there is in perl

1 Answers1