I do have a problem concerning the scraping of information from a certain xml-document (http://www.bundestag.de/xml/mdb/index.xml).
<mdbUebersicht>
<dokumentInfo>
<dokumentURL/>
<dokumentStand/>
</dokumentInfo>
<deleteRestore>
<deleteFlag>0</deleteFlag>
<deleteDate>20131202170000</deleteDate>
</deleteRestore>
<mdbs>
<mdb fraktion="Die Linke">
<mdbID status="Aktiv">1627</mdbID>
<mdbName status="Aktiv">Aken, Jan van</mdbName>
<mdbBioURL>
http://www.bundestag.de/abgeordnete18/biografien/A/aken_jan/258124
</mdbBioURL>
<mdbInfoXMLURL>
http://www.bundestag.de/xml/mdb/biografien/A/aken_jan.xml
</mdbInfoXMLURL>
<mdbInfoXMLURLMitmischen>/biografien/A/aken_jan.xml</mdbInfoXMLURLMitmischen>
<mdbLand>Hamburg</mdbLand>
<mdbFotoURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/177/265/83abda4f387877a2b5eeedbfd81e8eba/Yc/aken_jan_gross.jpg
</mdbFotoURL>
<mdbFotoGrossURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/316/475/83abda4f387877a2b5eeedbfd81e8eba/Uq/aken_jan_gross.jpg
</mdbFotoGrossURL>
<mdbFotoLastChanged>24.10.2016</mdbFotoLastChanged>
<mdbFotoChangedDateTime>24.10.2016 12:17</mdbFotoChangedDateTime>
<lastChanged>30.09.2016</lastChanged>
<changedDateTime>30.09.2016 12:38</changedDateTime>
</mdb>
The document contains a lot of short biographical aspects of different persons. Among other things it contains urls to other xml documents which contains a more detailed biography.
I try the following to get the information:
First I try to get all URLs for the different sub-documents from the maindocument
mdb_url <- xml_text(xml_find_all(xmlDocu, "//mdbInfoXMLURL"))
Then I implemented a for-loop which download all xml in my directory
for (url in mdb_url) {
download.file(url, destfile = basename(url))
}
Afterwards I want to received a list of the files...
files <- list.files(pattern = ".xml")
... to get a specific node of every xml doc:
Bio1 <- files[1]
xmlfile <- read_xml(Bio1)
mdb_ausschuss1 <- xml_text(xml_find_all(xmlfile, "//gremiumName"))
Now I have the problem how I can do it for all xml files in the list? I haven't been able to write a functional loop or script for that task...