I am stuck on regular expressions yet again but this time in R.
The problem I am facing is that I a vector I would like to extract a string between two [] for each row in the vector. However, sometimes I have cases where there is more than one series of [ ] in the whole statement and so I am recovering all strings in each row that is in the [ ]. In all cases I just need to recover the first instance of the string in the [ ] not the second or more instances. The example dataframe I have is:
comp541_c0_seq1 gi|356502740|ref|XP_003520174.1| PREDICTED: uncharacterized protein LOC100809655 [Glycine max]
comp5041_c0_seq1 gi|460370622|ref|XP_004231150.1| [Solanum lycopersicum] PREDICTED: uncharacterized protein LOC101250457 [Solanum lycopersicum]
The code i have been using that recovers the string and the index and makes a vector in the new dataframe are:
pattern <- "\\[\\w*\\s\\w*]"
match<- gregexpr(pattern, data$Description)
data$Species <- regmatches(data$Description, match)
the structure of the dataframe that I am using is:
data.frame': 67911 obs. of 6 variables:
$ Column1 : Factor w/ 67911 levels "comp100012_c0_seq1 ",..: 3344 8565 17875 18974 19059 19220 21429 29791 40214 48529 ...
$ Description : Factor w/ 26038 levels "0.0","1.13142e-173",..: NA NA NA NA NA NA NA NA 7970 NA ...
So the problem with my pattern match is that it return a vector (Species) where some of the rows have:
[Glycine max] # this is good
c("[Solanum lycopersicum]", "[Solanum lycopersicum]") # I only need one set returned
What I would like is:
[Glycine max]
[Solanum lycopersicum]
I have been trying every way I can with the regular expression. Would anyone know how to improve what I have to just extract the first instance of the string within [ ]?
Thanks in advance.