This is my first post to Stack so please let me know if I haven't posted enough information. I have however, looked through many of the other answered questions and tried many of those solutions which as a result have ended me here.
I am having trouble getting the data out of a series of around 800 xml files. I would like the following data frame.
Model Species PubChemID out of "rdf:li*
Abiotrophia_defectiva_ATCC_49176 M_10fthf__91__c__93__ 122347
Abiotrophia_defectiva_ATCC_49176 M_10m3hddcaACP__91__c__93__ N/A
I can clean the rest of the URL out after its in a data frame for the PubChemID
From the following xml example.
<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" xmlns:fbc="http://www.sbml.org/sbml/level3/version1/fbc/version2" xmlns:groups="http://www.sbml.org/sbml/level3/version1/groups/version1" level="3" version="1" fbc:required="false" groups:required="false">
<model metaid="Abiotrophia_defectiva_ATCC_49176" id="Abiotrophia_defectiva_ATCC_49176" name="Abiotrophia defectiva ATCC 49176" fbc:strict="true">
<notes>
<body xmlns="http://www.w3.org/1999/xhtml">
<div>
<h1>Abiotrophia_defectiva_ATCC_49176</h1>
<h2>Description</h2>
<p>This is a metabolism reconstruction of Abiotrophia defectiva ATCC 49176</p>1.03<p>Authors: Stefania Magnusdottir, Almut Heinken, Laura Kutt, Dmitry A. Ravcheev, Eugen Bauer, Alberto Noronha, Kacy Greenhalgh, Christian Jaeger, Joanna Baginska, Paul Wilmes, Ronan M.T. Fleming, and Ines Thiele.</p>
<h3>Draft information</h3>
<p>
<ul>
<li> PubSEED ID: Abiotrophia defectiva ATCC 49176 (592010.4)</li>
<li> Draft reconstruction ID: Seed592010_4_124632</li>
<li> Draft platform: ModelSEED</li>
<li> Draft created: 7/1/2014</li>
</ul>
</p>
<p>This work is licensed under a <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.</p>
<p>When using this model in your research works, please cite: Magnusdottir et al., Generation of genome-scale metabolic reconstructions for 773 members of the human gut microbiota, Nat Biotechnol, 2016.</p></div>
</body>
</notes>
<annotation>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
<rdf:Description rdf:about="#Abiotrophia_defectiva_ATCC_49176">
<bqbiol:is>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/taxonomy/592010"/>
</rdf:Bag>
</bqbiol:is>
</rdf:Description>
</rdf:RDF>
</annotation>
<listOfUnitDefinitions>
<unitDefinition id="mmol_per_gDW_per_hr">
<listOfUnits>
<unit kind="mole" exponent="1" scale="-3" multiplier="1"/>
<unit kind="gram" exponent="-1" scale="0" multiplier="1"/>
<unit kind="second" exponent="-1" scale="0" multiplier="3600"/>
</listOfUnits>
</unitDefinition>
</listOfUnitDefinitions>
<listOfCompartments>
<compartment metaid="c" id="c" name="Cytoplasm" constant="false"/>
<compartment metaid="e" id="e" name="Extracellular" constant="false"/>
</listOfCompartments>
<listOfSpecies>
<species metaid="M_10fthf__91__c__93__" id="M_10fthf__91__c__93__" name="10-Formyltetrahydrofolate" compartment="c" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-2" fbc:chemicalFormula="C20H21N7O7">
<annotation xmlns:sbml="http://www.sbml.org/sbml/level3/version1/core">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
<rdf:Description rdf:about="#M_10fthf__91__c__93__">
<bqbiol:is>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/hmdb/HMDB00972"/>
<rdf:li rdf:resource="http://identifiers.org/kegg.compound/C00234"/>
<rdf:li rdf:resource="http://identifiers.org/pubchem.compound/122347"/>
</rdf:Bag>
</bqbiol:is>
</rdf:Description>
</rdf:RDF>
</annotation>
</species>
<species metaid="M_10m3hddcaACP__91__c__93__" id="M_10m3hddcaACP__91__c__93__" name="10-methyl-3-hydroxy-dodecanoyl-ACP" compartment="c" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-1" fbc:chemicalFormula="C24H45N2O9PRS"/>
<species metaid="M_10m3hundecACP__91__c__93__" id="M_10m3hundecACP__91__c__93__" name="10-methyl-3-hydroxy-undecanoyl-ACP" compartment="c" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-1" fbc:chemicalFormula="C23H43N2O9PRS"/>
<species metaid="M_10m3oddcaACP__91__c__93__" id="M_10m3oddcaACP__91__c__93__" name="10-methyl-3-oxo-dodecanoyl-ACP" compartment="c" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-1" fbc:chemicalFormula="C24H43N2O9PRS"/>
</listOfSpecies>
</model>
</sbml>
I have been successful in converting to a list in r and calling the first element that should apply to the other 800 xml
library(xml2)
list <- xmlToList("StackExample.xml")
list[["model"]][["notes"]][["body"]][["div"]][["h1"]]
I also can get all of the species out but the fact that some of the nodes contain more hierarchy has got me a bit baffled.
species.list <- list$model$listOfSpecies
specieslist <- lapply(species.list, '[[', 1)
How does one add an if/else type function into "lapply" so that it looks for "/rdf:resources" in the additional hierarchy?
Lastly, I am pretty sure that applying whatever script to the remainder of the files should be doable.
Thanks