0

Long-time lurker, first post (be gentle). Am trying to make a "tidy" R data frame from a complex XML file. Partial success, but I can't figure out one step due to my unfamiliarity with R. I think it is not complicated but I can't for the life of me get past it. (Have done multiple google searches, multiple StackOverFlow looks, tried many things over 4 days --> #fail.)

  1. Extract parts of the XML file:

    library(XML)
    mss <- xmlParse("BITECA.toy.XML")
    xxx <- xmlToDataFrame(nodes = getNodeSet(mss, "//*/MsEd/MsEdID | //*/GeoMilestoneInfo/Dates"), collectNames=FALSE, stringsAsFactors = TRUE)
    
  2. write.table to a text file yields:

    "Bibliography"  "Type"  "IDNo"  "text"
    "BITECA"    "manid" "1086"  NA
    NA  NA  NA  "1351 - 1400 (Bohigas i Riera)"
    NA  NA  NA  "1341 - 1360 (Lola Badia)"
    NA  NA  NA  "1401 - 1450 (Panunzio)"
    "BITECA"    "manid" "2744"  NA
    NA  NA  NA  "1701 - 1800"
    
  3. My problem is how to fill in the NAs with repeats of the node identifiers to obtain the tidier data frame that I need. (Further processing is needed, but I think I know how to do it.)

    "Bibliography"  "IDNo"  "text"
    "BITECA"    "1086"  "1351 - 1400 (Bohigas i Riera)"
    "BITECA"    "1086"  "1341 - 1360 (Lola Badia)"
    "BITECA"    "1086"  "1401 - 1450 (Panunzio)"
    "BITECA"    "2744"  "1701 - 1800"
    

I wonder if this is one of those things that would require a 5 minute conversation with an R expert? Any help would be greatly appreciated! Thank you - pfs

EDITS
(a) in response to the request below, the file parsed in step 1 (BITECA.toy.XML) is here https://www.dropbox.com/s/6fs0usac2l1m76m/BITECA.toy.xml?dl=0
(b) clarification - the full XML file has thousands of "manid" entries, not just the several shown below

2 Answers2

1

For the third step, you may use na.locf from the zoo package:

 require(zoo)
 unique(as.data.frame(
     Map(na.locf,df,fromLast=rep(c(TRUE,FALSE),c(3,1)))
 ))
 #  Bibliography  Type IDNo                          text
 #1       BITECA manid 1086 1351 - 1400 (Bohigas i Riera)
 #2       BITECA manid 2744      1341 - 1360 (Lola Badia)
 #3       BITECA manid 2744        1401 - 1450 (Panunzio)
 #5       BITECA manid 2744                   1701 - 1800

It seems that for the first 3 column you have to carry forward the last observation, while for the fourth when you have NA you have to take the previous observation. This is why I used the fromLast argument set three times to TRUE and the fourth to FALSE.

The above works if df is your data.frame and is this object:

   df <- structure(list(Bibliography = structure(c(1L, NA, NA, NA, 1L, 
         NA), .Label = "BITECA", class = "factor"), Type = structure(c(1L, 
         NA, NA, NA, 1L, NA), .Label = "manid", class = "factor"), IDNo = c(1086L, 
         NA, NA, NA, 2744L, NA), text = structure(c(NA, 2L, 1L, 3L, NA, 
         4L), .Label = c("1341 - 1360 (Lola Badia)", "1351 - 1400 (Bohigas i Riera)", 
        "1401 - 1450 (Panunzio)", "1701 - 1800"), class = "factor")),     .Names = c("Bibliography", 
        "Type", "IDNo", "text"), class = "data.frame", row.names = c(NA, 
        -6L))
nicola
  • 24,005
  • 3
  • 35
  • 56
0

There is some questions already posted.....

How to transform XML data into a data.frame?

Hopefully this should be able to help you transform the XML to Data Frame issue. Once you have your dataframe, then you can use is.na(dataframe) to test and replace missing values.

Community
  • 1
  • 1
Sam
  • 63
  • 1
  • 7
  • Thx for the super-fast response. I have seen this post, tried to "coerce" the ideas in it to my problem. After a lengthy struggle, I couldn't get it to Step 3 above. pfs – pfsullivan Jul 19 '15 at 14:18
  • Can you post a sample of the xml file...it becomes easier to debug rather than guessing the ouput – Sam Jul 20 '15 at 03:52