Long-time lurker, first post (be gentle). Am trying to make a "tidy" R data frame from a complex XML file. Partial success, but I can't figure out one step due to my unfamiliarity with R. I think it is not complicated but I can't for the life of me get past it. (Have done multiple google searches, multiple StackOverFlow looks, tried many things over 4 days --> #fail.)
Extract parts of the XML file:
library(XML) mss <- xmlParse("BITECA.toy.XML") xxx <- xmlToDataFrame(nodes = getNodeSet(mss, "//*/MsEd/MsEdID | //*/GeoMilestoneInfo/Dates"), collectNames=FALSE, stringsAsFactors = TRUE)
write.table to a text file yields:
"Bibliography" "Type" "IDNo" "text" "BITECA" "manid" "1086" NA NA NA NA "1351 - 1400 (Bohigas i Riera)" NA NA NA "1341 - 1360 (Lola Badia)" NA NA NA "1401 - 1450 (Panunzio)" "BITECA" "manid" "2744" NA NA NA NA "1701 - 1800"
My problem is how to fill in the NAs with repeats of the node identifiers to obtain the tidier data frame that I need. (Further processing is needed, but I think I know how to do it.)
"Bibliography" "IDNo" "text" "BITECA" "1086" "1351 - 1400 (Bohigas i Riera)" "BITECA" "1086" "1341 - 1360 (Lola Badia)" "BITECA" "1086" "1401 - 1450 (Panunzio)" "BITECA" "2744" "1701 - 1800"
I wonder if this is one of those things that would require a 5 minute conversation with an R expert? Any help would be greatly appreciated! Thank you - pfs
EDITS
(a) in response to the request below, the file parsed in step 1 (BITECA.toy.XML) is here https://www.dropbox.com/s/6fs0usac2l1m76m/BITECA.toy.xml?dl=0
(b) clarification - the full XML file has thousands of "manid" entries, not just the several shown below