-1

I am new to R. I have downloaded the XML with all Bioprojects from the NCBI. The file is 1GB in size. I started with this:

setwd("C://Users/USER/Desktop/")
xmlfile = xmlParse("bioproject.xml")
root = xmlRoot(xmlfile)
xmlName(root)
[1] "PackageSet"
xmlSize(root)
[1] 357935

So there are 357935 projects in the NCBI. Here I'm looking at project 34:

> root[[34]]
<Package>
  <Project>
    <Project>
      <ProjectID>
        <ArchiveID accession="PRJNA44" archive="NCBI" id="44"/>
      </ProjectID>
      <ProjectDescr>
        <Name>Bartonella quintana str. Toulouse</Name>
        <Title>Causes bacillary angiomatosis</Title>
        <Description>&lt;P&gt;&lt;B&gt;&lt;I&gt;Bartonella quintana&lt;/I&gt; str. Toulouse&lt;/B&gt;. &lt;I&gt;Bartonella quintana&lt;/I&gt; str. Toulouse was isolated from human blood in Toulouse, France in 1993. There is evidence of extensive genome reduction in comparison to other &lt;I&gt;Bartonella&lt;/I&gt; species which may be associated with the limited host range of &lt;I&gt;Bartonella quintana&lt;/I&gt;.</Description>
        <ExternalLink category="Other Databases" label="GOLD">
          <URL>http://genomesonline.org/cgi-bin/GOLD/bin/GOLDCards.cgi?goldstamp=Gc00191</URL>
        </ExternalLink>
        <Publication date="2004-06-24T00:00:00Z" id="15210978" status="ePublished">
          <Reference/>
          <DbType>ePubmed</DbType>
        </Publication>
        <ProjectReleaseDate>2004-06-25T00:00:00Z</ProjectReleaseDate>
        <LocusTagPrefix assembly_id="GCA_000046685" biosample_id="SAMEA3138248">BQ</LocusTagPrefix>
      </ProjectDescr>   
      <ProjectType>
        ...
        ...
      </ProjectType>
    </Project>
    <Submission submitted="2003-03-20">
      ...
      ...
    </Submission>
    <ProjectLinks>
      ...
      ...
    </ProjectLinks>
  </Project>
</Package>

What I need is to obtain ALL the <ProjectID> values (in this case, PRJNA44) in the entire XML file, ONLY IF in <Description> within <ProjectDescr> of each project there exist the text "isolated from human blood" (or "blood", if this makes the script simpler). Alternatively, if it makes it simpler, instead of obtaining the ProjectID, I can obtain the <URL> value within <ExternalLink within <ProjectDescr>.

I don't know how (or whether) to use the xpath function (or xpathApply or getNodeSet or xpathSApply). Thank you for the help.

  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. Incomplete data with "..." isn't helpful because R can't parse that. And not have any any records that don't match makes it difficult to test if the code is doing what it's supposed to be doing. – MrFlick Mar 18 '19 at 17:14

1 Answers1

0

This is a pretty straight forward problems with plenty examples out there.
I find the syntax of the "xml2" package easier to use than "XML" package.

The sample above an a project node as a sub node to another node labeled project, this could cause problems if trying to selection this node. To find the correct node I parsed for the project node as the subnode of project.

library(xml2)
library(dplyr)

#read xml document
page<-read_xml("bioproject.xml")

#find all of the project nodes
projectnodes<-xml_find_all(page, ".//Project/Project")

#loop through all of the nodes and extract the requested information
dfs<-lapply(projectnodes, function(node) {
   #find description text
   description<-xml_find_first(node, ".//Description") %>% xml_text()
   #find the URL link
   link<-xml_find_first(node, ".//URL") %>% xml_text()
   #find project ID 
   projid<-xml_find_first(node, ".//ArchiveID") %>% xml_attr("accession")
   #store data into individual data frames
   df<-data.frame(projid, link, description, stringsAsFactors = FALSE)
})  


#bind all of the rows together into a single final data frame
answer<-bind_rows(dfs)

#find rows with the keyword using regular expressions.
answer[grep("blood", answer$description),]
Dave2e
  • 22,192
  • 18
  • 42
  • 50