I am parsing an XML file from a web service then turning it into a data.frame. Here I have listed my sample code, which honestly is a direct copy from.
http://www.r-bloggers.com/r-and-the-web-for-beginners-part-ii-xml-in-r/
I admit I am fairly new to the use of XML files, but I need to parse it into a dataframe.
library(RCurl)
library(XML)
xml.url <-('webservice url that links to an XML document')
xml.file <- xmlTreeParse(xml.url)
xmltop <- xmlRoot(xml.file)
Data <- xmlSApply(xmltop,function(x) xmlSApply(x,xmlValue))
Data <- data.frame(t(Data),row.names=NULL)
Here is a sample of the data I am working with. I have limited this to a few columns because there are over 300 of them.
Data <- structure(list(start = structure(list(row = "05/11/2014 06:59:48 UTC",
row = "05/11/2014 06:45:59 UTC", row = "05/11/2014 06:26:16 UTC",
row = "05/11/2014 06:52:42 UTC"), .Names = c("row", "row",
"row", "row")), end = structure(list(row = "05/11/2014 14:16:23 UTC",
row = "05/11/2014 13:52:10 UTC", row = "05/11/2014 13:38:41 UTC",
row = "05/11/2014 14:34:42 UTC"), .Names = c("row", "row",
"row", "row")), today = structure(list(row = "05/11/2014", row = "05/11/2014",
row = "05/11/2014", row = "05/11/2014"), .Names = c("row",
"row", "row", "row")), Record_Name = structure(list(row = character(0),
row = character(0), row = character(0), row = character(0)), .Names = c("row",
"row", "row", "row")), Watersource_GPS_Cords = structure(list(
row = "22.503822:88.347462:0.0:26.0", row = "22.505717:88.348593:20.044726:16.0",
row = "22.503821:88.34746:0.0:27.0", row = "22.505585:88.347121:-43.040066:12.0"), .Names = c("row",
"row", "row", "row")), Description_of_location = structure(list(
row = character(0), row = "By swisspark nursing home", row = character(0),
row = character(0)), .Names = c("row", "row", "row", "row"
)), Free_chlorine_input = structure(list(row = "2.5", row = "1.36",
row = "1.1", row = character(0)), .Names = c("row", "row",
"row", "row"))), .Names = c("start", "end", "today", "Record_Name",
"Watersource_GPS_Cords", "Description_of_location", "Free_chlorine_input"
), class = "data.frame", row.names = c(NA, -4L))
Here is my sessionInfo()
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_0.6.2 shiny_0.9.1 XML_3.98-1.1 RCurl_1.95-4.1 bitops_1.0-6
loaded via a namespace (and not attached):
[1] caTools_1.17 digest_0.6.4 httpuv_1.3.0 plyr_1.8.1 Rcpp_0.11.1
[6] RJSONIO_1.2-0.2 tools_3.0.1 xtable_1.7-3
All is good except for these 0 character length vectors with in lists with in a data frame. I assume that because the parser for the XML file detects them it has to store them in the form of a list then wrap the list in a data frame or else the vectors would not be the same length and would be erroneous.I am really confused on how to elegantly deal with these things. I would prefer to set an option that turned these into either NA's or just "" and just have a data frame of vectors instead of lists if possible even format each column to the appropriate . Mainly because I need to write logical tests between columns.
I am use to dealing with a structure something like.
> Data[,"Description_of_location"]
[1]""
[2]"By swisspark nursing home"
[3]""
[4]""
Instead I get.
> Data[,"Description_of_location"]
$row
character(0)
$row
[1] "By swisspark nursing home"
$row
character(0)
$row
character(0)
Here is a sample of the XML document.
-<data version="1.0">
-<row>
<start type="JAVA_ROSA_DATETIME">05/11/2014 06:59:48 UTC</start>
<end type="JAVA_ROSA_DATETIME">05/11/2014 14:16:23 UTC</end>
<today type="JAVA_ROSA_DATE">05/11/2014</today>
-<deviceid type="STRING">
<![CDATA[358870052616368]]>
</deviceid>
-<subscriberid type="STRING">
<![CDATA[404310209661081]]>
</subscriberid>
-<simid type="STRING">
<![CDATA[89913100002096610814]]>
</simid>
-<phonenumber type="STRING">
<![CDATA[918420272664]]>
</phonenumber>
-<mobilekey type="STRING">
<![CDATA[ag9zfmRlbGFndWFtb2JpbGVyFwsSCk1vYmlsZVVuaXQYgICAgMD6-wkM]]>
</mobilekey>
-<projectkey type="STRING">
<![CDATA[ag9zfmRlbGFndWFtb2JpbGVyFAsSB1Byb2plY3QYgICAgKD9hQkM]]>
</projectkey>
-<recordid type="STRING">
<![CDATA[mannaenergy$$05082014141658$$Published&&12]]>
</recordid>
<Record_Name type="STRING"/>
<Watersource_GPS_Cords type="GEOPOINT">22.****:88.****:0.0:26.0</Watersource_GPS_Cords>
-<State_Name type="STRING">
<![CDATA[West Bengal]]>
</State_Name>
-<District_Name type="STRING">
<![CDATA[Kolkata]]>
</District_Name>
-<Block_Name type="STRING">
<![CDATA[Bikram]]>
</Block_Name>
-<Panchayat_Name type="STRING">
<![CDATA[Ashok nagar]]>
</Panchayat_Name>
-<Village_Name type="STRING">
<![CDATA[East lake]]>
</Village_Name>
-<Habitation_Name type="STRING">
<![CDATA[Merlin colony]]>
</Habitation_Name>
-<Unique_water_source_ID type="STRING">
<![CDATA[15]]>
</Unique_water_source_ID>
<Description_of_location type="STRING"/>
-<Type_of_Water_Source type="STRING">
<![CDATA[Public_tap]]>
</Type_of_Water_Source>
<Take_a_sample_for_chemical_tes type="STRING"/>
<Turbidity_TU_input type="STRING"/>
<Turbidity_FAU_input type="DECIMAL"/>
<Turbidity_FAU_range type="STRING"/>
<Warning_turb_FAU type="INTEGER"/>
<Turbidity_NTU_input type="DECIMAL">0.95</Turbidity_NTU_input>
-<Turbidity_NTU_range type="STRING">
<![CDATA[In_range]]>
</Turbidity_NTU_range>
<Warning_turb_NTU type="INTEGER"/>
I am sorry if this is information overload I tried to provide everything I could think of that would prove useful.
To summarize. I am trying to parse this XML file such that 0 length character vectors have become blank or NA elements within a data frame. Bonus if I can get the type of each column to be reflected in the structure of each column within the data frame. Hopefully this is clear enough.
Thank you very much for any and all help.