I'm trying to extract PMID values which are pubmed journal identifiers. A typical one looks like: <PMID Version=\"1\">30556505</PMID>
I extract that with:
strapplyc(startingString, "<PMID Version=\"1\">(.*?)</PMID>", simplify = c)
The reason I use strapplyc
as there could be several of those PMID values in the xml string. However, some of them I do not want, specifically those wrapped in a comments/correction tag (example):
<CommentsCorrectionsList> <CommentsCorrections RefType=\"CommentIn\"> <RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource> <PMID Version=\"1\">30641052</PMID> </CommentsCorrections> </CommentsCorrectionsList>
How would be the regular expression need to be changed to ignore those in the CommentsCorrectionsList tag?
The packages are:
gsubfn
for strapplyc