1

I am trying to create a dataframe containing the values for a few fields extracted from an xml. I am new to xml files and have no idea what I'm doing.

I have tried follow the instructions posted here How to parse XML to R data frame but I haven't been able to get it to work. The xml is formatted like this with many other fields (not shown) within GENESET

<?xml version="1.0" encoding="ISO-8859-1"?>   
-< MSIGDB BUILD_DATE="Jul 12, 2018" VERSION="6.2" NAME="msigdb">
      < GENESET VALIDATION_DATASETS="" CATEGORY_CODE="C3"  EXACT_SOURCE="GOID: 00098" STANDARD_NAME="AAANWWTGC_UNKNOWN"/>

Ideally, I would like each column of the dataframe to just be a list of values for each field within GENESET (i.e. column 1= CATEGORY_CODE; column 2= EXACT_SOURCE). I'd also like the dataframe to have N/A if the field is blank for a specific GENESET.

I have tired this:

require(XML)
doc <- xmlParse("msigdb_v6.2.xml")
exactSource <- as.list(xml_data[["MSIGDB"]][["GENESET"]][["EXACT_SOURCE"]])

but the output of head(exactSource) is

list()

Please help

charlie
  • 83
  • 1
  • 3

1 Answers1

0

Since you only need attribute values, consider undocumented, xmlAttrsToDataFrame, in XML.

Assuming the following, fuller XML example which includes missing nodes and empty attributes:

<?xml version="1.0" encoding="ISO-8859-1"?>   
<MSIGDB BUILD_DATE="Jul 12, 2018" VERSION="6.2" NAME="msigdb">
      <GENESET VALIDATION_DATASETS="" CATEGORY_CODE="C3"  EXACT_SOURCE="GOID: 00096"/>
      <GENESET VALIDATION_DATASETS="" EXACT_SOURCE="GOID: 00097" STANDARD_NAME="BBBNWWTGC_UNKNOWN"/>
      <GENESET VALIDATION_DATASETS="" CATEGORY_CODE="C5"  EXACT_SOURCE="GOID: 00098" STANDARD_NAME="CCCNWWTGC_UNKNOWN"/>
      <GENESET VALIDATION_DATASETS="" CATEGORY_CODE="C6"  EXACT_SOURCE="GOID: 00099" STANDARD_NAME=""/>
      <GENESET VALIDATION_DATASETS="" CATEGORY_CODE="C7" STANDARD_NAME="EEENWWTGC_UNKNOWN"/>
</MSIGDB>

R

library(XML)

doc <- xmlParse("msigdb_v6.2.xml")
geneset_df <- XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//GENESET'))

geneset_df

#   VALIDATION_DATASETS CATEGORY_CODE EXACT_SOURCE      STANDARD_NAME
# 1                                C3  GOID: 00096               <NA>
# 2                              <NA>  GOID: 00097  BBBNWWTGC_UNKNOWN
# 3                                C5  GOID: 00098  CCCNWWTGC_UNKNOWN
# 4                                C6  GOID: 00099                   
# 5                                C7         <NA>  EEENWWTGC_UNKNOWN
Parfait
  • 104,375
  • 17
  • 94
  • 125