R regular expression of XML tag that's NOT in another tag

Question

I'm trying to extract PMID values which are pubmed journal identifiers. A typical one looks like: <PMID Version=\"1\">30556505</PMID>

I extract that with:

strapplyc(startingString, "<PMID Version=\"1\">(.*?)</PMID>", simplify = c)

The reason I use strapplyc as there could be several of those PMID values in the xml string. However, some of them I do not want, specifically those wrapped in a comments/correction tag (example):

<CommentsCorrectionsList> <CommentsCorrections RefType=\"CommentIn\"> <RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource> <PMID Version=\"1\">30641052</PMID> </CommentsCorrections> </CommentsCorrectionsList>

How would be the regular expression need to be changed to ignore those in the CommentsCorrectionsList tag?

The packages are: gsubfn for strapplyc

It would be helpful if you format the code more legibly and include what libraries you're using (like where `strapplyc` comes from). It's also generally a better/safer idea to use actual xml-parser functions to parse XML instead of trying to brute force it with regex. Also, what's the significance of the backslashes in your tag attributes? — camille, Sep 17 '19 at 20:43
@camille: I've attached the packages. Regarding your XML-parser, I would agree but the ones that came in the package were fussy so I decided to build my own. At this point it's almost the challenge of getting it right that's exciting. — user1357015, Sep 17 '19 at 20:45
Regex isn't the right tool to parse XML/HTML(see https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Try use `xml2` package and use XPATH instead. — yusuzech, Sep 17 '19 at 20:51
Use a regex to match _both_. When you match the one you don't want, just ignore it. You can tell based on which group matched. Probably needs a callback. If you want to use regex it will have to be able to be a functional implementation like perl or pcre or java or js at a minimum. Let me know if it can do this both the regex and the r language and I'll show you how. — , Sep 17 '19 at 21:43

G. Grothendieck · Answer 1 · 2019-09-25T03:57:58.313

If we have a well formed XML document then we would normally use the XML or xml2 package to parse it. We only have snippets in the question and the actual format would be important to know but as an exakmple let us say that we have the format in the Note at the end. That is each tag that we want is directly under the root. The other ones are more than one level down. Then

library(magrittr)
library(xml2)

Lines %>%
  read_xml %>%
  xml_find_all("./PMID") %>%
  xml_text
## [1] "30556505"

Alternately there are a number of R packages for accessing PubMed including easyPubMed, pubmed.mineR, rentrez and RISmed on CRAN, annotate on Bioconductor and Rcupcake on github.

Note

Assumed input:

Lines <- 
"<root>
<PMID Version=\"1\">30556505</PMID>
<CommentsCorrectionsList>
<CommentsCorrections RefType=\"CommentIn\">
<RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource>
<PMID Version=\"1\">30641052</PMID>
</CommentsCorrections>
</CommentsCorrectionsList>
</root>"

R regular expression of XML tag that's NOT in another tag

1 Answers1

Note