0

I need to capture the title between the words TITLE and JOURNAL and to exclude a scenario in which the captured string is Direct Submission.
for instance, in the the following text,

  TITLE     The Identification of Novel Diagnostic Marker Genes for the
            Detection of Beer Spoiling Pediococcus damnosus Strains Using the
            BlAst Diagnostic Gene findEr
  JOURNAL   PLoS One 11 (3), e0152747 (2016)
   PUBMED   27028007
  REMARK    Publication Status: Online-Only
REFERENCE   2  (bases 1 to 462)
  AUTHORS   Behr,J., Geissler,A.J. and Vogel,R.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (04-AUG-2015) Technische Mikrobiologie, Technische

the captured string needs to be only
'The Identification of Novel Diagnostic Marker Genes for the Detection of Beer Spoiling Pediococcus damnosus Strains Using the BlAst Diagnostic Gene findEr', either with or without new line characters (preferably without new line characters).
I tried applying regular expressions such as those offered here and here, but couldn't apply them to my needs.
Thanks.

random
  • 146
  • 2
  • 10

1 Answers1

3

(?<=TITLE)[\S\s]*?(?=JOURNAL)

Should work. (?<=TITLE) is to make sure that match is preceded by TITLE. (?=JOURNAL) is to make sure that it is followed by JOURNAL.

To exclude Direct Submission, use (?<=TITLE)(?!\s*Direct Submission)[\S\s]*?(?=JOURNAL). However, this approach will also exclude string that starts with Direct Submission. Here is the result.

llesha
  • 423
  • 1
  • 15
  • 1
    Consider using `[\W\w]` or `[\S\s]` instead of `[\w\s]`. Currently, this will fail to match strings containing punctuation characters, for example. – 41686d6564 stands w. Palestine Mar 12 '23 at 09:13
  • Thanks @llesha. I added to the question a scenario in which a specific string should be excluded. The string that needs to be excluded is `Direct Submission`. Can you modify your answer accordingly? – random Mar 12 '23 at 09:18
  • @random, I guess, this should work: `(?<=TITLE)(?!\s*Direct Submission)[\S\s]*(?=JOURNAL)` – llesha Mar 12 '23 at 10:49
  • Thanks @llesha. I already tried that and it didn't work: https://regex101.com/r/E0Pj8w/1. – random Mar 12 '23 at 10:57
  • 1
    I modified your suggested expression so that it's non-greedy and it work. Thanks. – random Mar 12 '23 at 11:06
  • @random modified my second regex with your suggestion, added link to show that it works with multiple entries. – llesha Mar 12 '23 at 12:02