I am new to R. I have downloaded the XML with all Bioprojects from the NCBI. The file is 1GB in size. I started with this:
setwd("C://Users/USER/Desktop/")
xmlfile = xmlParse("bioproject.xml")
root = xmlRoot(xmlfile)
xmlName(root)
[1] "PackageSet"
xmlSize(root)
[1] 357935
So there are 357935 projects in the NCBI. Here I'm looking at project 34:
> root[[34]]
<Package>
<Project>
<Project>
<ProjectID>
<ArchiveID accession="PRJNA44" archive="NCBI" id="44"/>
</ProjectID>
<ProjectDescr>
<Name>Bartonella quintana str. Toulouse</Name>
<Title>Causes bacillary angiomatosis</Title>
<Description><P><B><I>Bartonella quintana</I> str. Toulouse</B>. <I>Bartonella quintana</I> str. Toulouse was isolated from human blood in Toulouse, France in 1993. There is evidence of extensive genome reduction in comparison to other <I>Bartonella</I> species which may be associated with the limited host range of <I>Bartonella quintana</I>.</Description>
<ExternalLink category="Other Databases" label="GOLD">
<URL>http://genomesonline.org/cgi-bin/GOLD/bin/GOLDCards.cgi?goldstamp=Gc00191</URL>
</ExternalLink>
<Publication date="2004-06-24T00:00:00Z" id="15210978" status="ePublished">
<Reference/>
<DbType>ePubmed</DbType>
</Publication>
<ProjectReleaseDate>2004-06-25T00:00:00Z</ProjectReleaseDate>
<LocusTagPrefix assembly_id="GCA_000046685" biosample_id="SAMEA3138248">BQ</LocusTagPrefix>
</ProjectDescr>
<ProjectType>
...
...
</ProjectType>
</Project>
<Submission submitted="2003-03-20">
...
...
</Submission>
<ProjectLinks>
...
...
</ProjectLinks>
</Project>
</Package>
What I need is to obtain ALL the <ProjectID>
values (in this case, PRJNA44) in the entire XML file, ONLY IF in <Description>
within <ProjectDescr>
of each project there exist the text "isolated from human blood" (or "blood", if this makes the script simpler). Alternatively, if it makes it simpler, instead of obtaining the ProjectID, I can obtain the <URL>
value within <ExternalLink
within <ProjectDescr>
.
I don't know how (or whether) to use the xpath
function (or xpathApply
or getNodeSet
or xpathSApply
). Thank you for the help.