1

I am stuck on regular expressions yet again but this time in R.

The problem I am facing is that I a vector I would like to extract a string between two [] for each row in the vector. However, sometimes I have cases where there is more than one series of [ ] in the whole statement and so I am recovering all strings in each row that is in the [ ]. In all cases I just need to recover the first instance of the string in the [ ] not the second or more instances. The example dataframe I have is:

comp541_c0_seq1     gi|356502740|ref|XP_003520174.1| PREDICTED: uncharacterized protein LOC100809655 [Glycine max]
comp5041_c0_seq1    gi|460370622|ref|XP_004231150.1| [Solanum lycopersicum] PREDICTED: uncharacterized protein LOC101250457 [Solanum lycopersicum]

The code i have been using that recovers the string and the index and makes a vector in the new dataframe are:

pattern <- "\\[\\w*\\s\\w*]"
match<- gregexpr(pattern, data$Description)
data$Species <- regmatches(data$Description, match)

the structure of the dataframe that I am using is:

data.frame':    67911 obs. of  6 variables:
 $ Column1           : Factor w/ 67911 levels "comp100012_c0_seq1 ",..: 3344 8565 17875 18974 19059 19220 21429 29791 40214 48529 ...
 $ Description     : Factor w/ 26038 levels "0.0","1.13142e-173",..: NA NA NA NA NA NA NA NA 7970 NA ...

So the problem with my pattern match is that it return a vector (Species) where some of the rows have:

[Glycine max] # this is good
c("[Solanum lycopersicum]", "[Solanum lycopersicum]") # I only need one set returned

What I would like is:

[Glycine max]
[Solanum lycopersicum]

I have been trying every way I can with the regular expression. Would anyone know how to improve what I have to just extract the first instance of the string within [ ]?

Thanks in advance.

DJF
  • 107
  • 1
  • 2
  • 7
  • Are you looking for something like in [this question](http://stackoverflow.com/questions/30027266) or [this question](http://stackoverflow.com/questions/29681763)? – Sebastian Simon May 25 '15 at 01:18
  • 1
    Use `regexpr` instead of `gregexpr` to get a single match. (Your title threw me off, since you clearly know how to handle square brackets already, by the way.) – Frank May 25 '15 at 01:19
  • Frank, I think I did use regmatches, the code i used in R is posted up there. Xufox, I'm not sure what your asking. – DJF May 25 '15 at 01:22
  • hmm, ok i'll give that a try. – DJF May 25 '15 at 01:23
  • well the regexpr is not working. its throwing an error: Error in `$<-.data.frame`(`*tmp*`, "Species", value = c("[Glycine max]", : replacement has 38383 rows, data has 67911 – DJF May 25 '15 at 01:27
  • @djfreeze - that's an issue with regmatches not returning anything when there's no `[]`, thus you get a different length between result and replacement. Try it without doing any `<-` assignment. You'll see it works. You need to assign to a subset of `data$Description` – thelatemail May 25 '15 at 02:04

2 Answers2

3

I think this example should be illuminating to your problems:

txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here")
pattern <- "\\[\\w*\\s\\w*]"
mat <- regexpr(pattern,txt)
#[1]  1  1 -1
#attr(,"match.length")
#[1] 14 15 -1
txt[mat != -1] <- regmatches(txt, mat)
txt
#[1] "[Bracket text]"      "[Bracket text1]"     "No brackets in here"

Or if you want to do it all in one go and return NA values for non-matches, try:

ifelse(mat != -1, regmatches(txt,mat), NA)
#[1] "[Bracket text]"  "[Bracket text1]" NA 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Hi. I'm sorry, I'm looking at the output generated from my mat and its not the same as what you're showing. for example: # [1] 82 152 # attr(,"match.length") # [1] 16 16 – DJF May 25 '15 at 12:28
  • @djfreeze - what do you mean? I simply gave an example text and applied your pattern to it to show how the process works. Why would you expect that to give the same result as when applied to your data? – thelatemail May 25 '15 at 22:30
1

Using the base-R facilities for string manipulation is just making life hard for yourself. Use rebus to create the regular expression, and stringi (or stringr) to get the matches.

library(rebus)
library(stringi)

txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here") # thanks, thelatemail
pattern <- OPEN_BRACKET %R% 
  alnum(1, Inf) %R% 
  space(1, Inf) %R% 
  alnum(1, Inf) %R% 
  "]"
stri_extract_first_regex(txt, pattern)
## [1] "[Bracket text]"  "[Bracket text1]" NA

I suspect that you probably don't want to keep those square brackets. Try this variant:

pattern <- OPEN_BRACKET %R% 
  capture(
    alnum(1, Inf) %R% 
    space(1, Inf) %R% 
    alnum(1, Inf)
  ) %R% 
  "]"
stri_match_first_regex(txt, pattern)[, 2]
## [1] "Bracket text"  "Bracket text1" NA
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 1
    hi, thanks this is great. I guess the one problem is I want to omit bracket text1, but I'm not sure how to do that. ' – DJF May 25 '15 at 12:16
  • oh great! thanks, i got it to work. I'm not sure I understand the code, but I'll read through this. Thanks a lot! – DJF May 25 '15 at 12:37
  • The stri_extract part is a little simpler, yes, but I really don't see the value in rebus. Why would you abandon using a (fairly) standard logic that can be applied across bash scripts, Perl, Javascript, and other statistics programs like SAS and Stata? – thelatemail May 25 '15 at 22:43
  • @thelatemail The point of rebus is that I struggle to remember regular expression syntax. And even when I can remember it, I find that they very quickly become tricky to read (and trickier to debug) when they get long. So rebus lets you generate regular expressions in a halfway readable manner, which is excellent when trying to remember what you did six months ago, or when sharing code with less technical colleagues. – Richie Cotton May 26 '15 at 06:36
  • similarly, the code by Richie Cotton helped me to fix my problem. If you could suggest how to correct my pattern to extract the data needed that would be great but I could not find a solution with my current knowledge. – DJF May 27 '15 at 19:03