0

This is not what I wanted: How to parse XML to R data frame

I don't know xml well and have a xml file as:

data <- xmlParse("file1.xml")

print(data)

# $`PAMRasterBand`
# <PAMRasterBand band="1">
#  <Metadata>
#    <MDI key="STATISTICS_MAXIMUM">0.43242582678795</MDI>
#    <MDI key="STATISTICS_MEAN">0.11312322099674</MDI>
#    <MDI key="STATISTICS_MINIMUM">-0.019055815413594</MDI>
#    <MDI key="STATISTICS_STDDEV">0.054616362290023</MDI>
#    <MDI key="STATISTICS_VALID_PERCENT">61.25</MDI>
#  </Metadata>
# </PAMRasterBand> 

I want to parse the value of "STATISTICS_MEAN" from it and turn it to the data.table or data.frame in R.

I saw some examples but couldn't really get how to do it specifically for this purpose. I want to do this for 439 files as above in a loop.[From File1 to File439, every file has the same attributes] So, if you helped me to loop it, I would be flattered.

Batuhan Kavlak
  • 337
  • 3
  • 9
  • 1
    As an aside, you should not use "data" as a variable name, as it is a function in R. Ideally, in the example I give below you would not call the column names function names either (eg. mean, sd). – dayne Feb 12 '19 at 14:36

2 Answers2

2

Consider combining xmlDataFrame with the internal method, xmlAttrsToDataFrame (requiring triple-colon operator) to return the attribute value alongside the text value for every MDI node:

library(XML)

doc <- xmlParse('/path/to/input.xml')

xmldataframe <- cbind(xmlToDataFrame(nodes=getNodeSet(doc, "//MDI")),
                      XML:::xmlAttrsToDataFrame(getNodeSet(doc, "//MDI")))
xmldataframe 
#                 text                      key
# 1   0.43242582678795       STATISTICS_MAXIMUM
# 2   0.11312322099674          STATISTICS_MEAN
# 3 -0.019055815413594       STATISTICS_MINIMUM
# 4  0.054616362290023        STATISTICS_STDDEV
# 5              61.25 STATISTICS_VALID_PERCENT

And for a loop across many XML files, wrap above in a function that receives a file path:

proc_xml <- function(f) {
    doc <- xmlParse(f)

    xmldataframe <- transform(cbind(xmlToDataFrame(nodes=getNodeSet(doc, "//MDI")),
                                    XML:::xmlAttrsToDataFrame(getNodeSet(doc, "//MDI"))), 
                              file = f)                      
    return(xmldataframe)
}

xml_files <- list.files(path="/folder/to/xml/files", pattern=".xml")

df_list <- lapply(xml_files, proc_xml)
final_df <- do.call(rbind, df_list)
Parfait
  • 104,375
  • 17
  • 94
  • 125
1
library(data.table)
as.data.table(xmlToDataFrame(xml("file1.xml")))
#                MDI               NA                 NA                NA    NA
#1: 0.43242582678795 0.11312322099674 -0.019055815413594 0.054616362290023 61.25

You can write a simple wrapper, then use lapply, rbindlist, and setnames to load all the files and clean up.

loadXML <- function(x) as.data.table(xmlToDataFrame(xml(x)))
fls <- rep("test.xml", 10)
datLst <- lapply(fls, loadXML)
dat <- rbindlist(datLst)
setnames(dat, c("maximum", "mean", "minimum", "sd", "vald_perc"))

dat[ , lapply(.SD, type.convert)]
#       maximum      mean     minimum         sd vald_perc
#  1: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  2: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  3: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  4: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  5: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  6: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  7: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  8: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
#  9: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
# 10: 0.4324258 0.1131232 -0.01905582 0.05461636     61.25
dayne
  • 7,504
  • 6
  • 38
  • 56
  • Hi, it gives all variables into one column when I run the rbindlist code. how can I parse them into 5 columns? – Batuhan Kavlak Feb 12 '19 at 14:57
  • But are you sure the attributes will always align in that order? Do note there is no order rule for attributes in XML per [W3C specifications](https://www.w3.org/TR/REC-xml/#sec-starttags): *the order of attribute specifications in a start-tag or empty-element tag is not significant.* – Parfait Feb 12 '19 at 16:07
  • @BatuhanKavlak I cannot speak to the issue without a reproducible example. Since we cannot play with your specific files, that is hard. I did a little test and cannot reproduce your issue. See edit. – dayne Feb 12 '19 at 16:15
  • @Parfait You are correct. If the OP's files are not identical in order, then this will not work. The wrapper would have to be a little more complicated, using `xlmToList` and parsing out the specific fields of interest. – dayne Feb 12 '19 at 16:21
  • Actually you can retrieve attributes with internal method, `xmlAttrsToDataFrame`. – Parfait Feb 12 '19 at 16:25