1

For tl;dr I have a simple question at the bottom:

I'm trying to turn XML files into use-able tables in R.

<toes copyright='(C)version='1.1'> <generated date='2017-01-21 
07:45:04'timestamp='1485006304'/>
   <description> Active TOE vehicle levels and adjustments for the current 
   campaign up to the RDP cycle in progress. c0 = the cycle 0 capacity, adj 
   = comma-separated list of cycle:capacity adjustments, cur = current 
   capacity </description>
   <defaults><def att='adj' value=''/></defaults>
       <r toe="deairfor" veh="22" c0="30" cur="30"/>
       <r toe="deairfor" veh="23" c0="40" cur="20" adj="1:35,2:20"/>
       <r toe="deairfor" veh="26" c0="2" cur="2" adj="2:10,3:30"/>
</toes>

My intended format is this:

"TOE" "Veh" "c0" "cur" "adj1" "adj2" "adj3"
"deairfor" 22 30 30 NA NA NA
"deairfor" 23 40 20 35 20 NA
"deairfor" 26 2 2 NA 10 30 

I have zero experience with importing XML files but I think this file is not formatted properly as I haven't encountered any XML example with data inside the tags like in < r toe "...data..."/>. I have been able to extract the data with the following:

library(XML)
source <- "http://wiretap.wwiionline.com/xml/toes.sheet.xml"
xmlfile <- xmlTreeParse(source, useInternalNodes = TRUE)
nodes <- getNodeSet(xmlfile, "/toes//r")
Df1 <- NULL

for(i in 1:length(nodes)) {
Df1 <- t(xmlToList(nodes[[i]]))
Df2 <- smartbind(Df2,Df1[1,])
}

I was only able to extract 1 row at a time, so I use the later code to bind these together. I needed the df1/2 otherwise it would error out on i = 1. Probably much easier in a different way but I couldn't get it working.

This leaves me with a dataframe Df2, with all the variables as "factor" (why?)

"TOE" "Veh" "c0" "cur" "adj"
deairfor 22 30 30 NA
deairfor 23 40 20 35 1:35,2:20
deairfor 26 2 2 2 2:10,3:30 

So the difficulty lies now in this "adj" column. I can separate it with the following:

Df2 <- separate(data = Df2, col = adj, into = c("adj1", adj2","adj3"), sep = "\\,")
Df2 <- separate(data = Df2, col = adj1, into = c("adj1","adj1value"), sep = "\\:")
Df2 <- separate(data = Df2, col = adj2, into = c("adj2","adj2value"), sep = "\\:")
Df2 <- separate(data = Df2, col = adj3, into = c("adj3","adj3value"), sep = "\\:")

But the cells are not in the right columns. Df2 now is as below:

"TOE" "Veh" "c0" "cur" "adj1" "adj1value" "adj2" "adj2value" "adj3" "adj3value"
deairfor 22 30 30 NA NA NA NA NA NA
deairfor 23 40 20 1 35 2 20 NA NA
deairfor 26 2 2 2 10 3 30 NA NA 

While this last row needs to be: (once the adj1values are in the proper columns we can also drop adj1/adj2/adj3)

deairfor 26 2 2 NA NA 2 10 3 30 

I've tried numerous ways to move these cells to the right, but constantly get errors, e.g.: (the adj* columns are characters after separating hence the "1")

Df2$adj3[Df2$adj1 == "1"] <- Df2$adj2
Df2$adj3value[Df2$adj1 == "1"] <- Df2$adj2value
"NAs are not allowed in subscripted assignments"

So question: How do I move these values to the proper column?

"TOE" "Veh" "c0" "cur" "adj"
deairfor 26 2 2 2:10,3:30 

Should become

"TOE" "Veh" "c0" "cur" "adj1" "adj2" "adj3"
deairfor 26 2 2 NA 10 30 

Bonus question: I get the feeling I need to use many lines because the XML import at the beginning isn't quite optimal, anyway to do that better given the goal that I have?

M.R.
  • 36
  • 5
  • Try some of what this post uses to create a frame from xml and see if it works for you. http://stackoverflow.com/questions/17198658/how-to-parse-xml-to-r-data-frame – sconfluentus Jan 26 '17 at 13:09
  • Curious, your posted xml does not match url as webpage has no *adj* attribs. – Parfait Jan 26 '17 at 23:58
  • Yeah webpage updates over time. Adj will only appear in two weeks again unfortunately. – M.R. Jan 28 '17 at 13:35

2 Answers2

1

I would write a function that can add NAs to the prefixed adj string and then use tidyr's separate

add_NAs <- function(x, n=3){
   y <- strsplit(x, ",")
   sapply( y, function(z){  
      n <- match( 1:n, substr(z,1,1))
      paste(substring(z, 3)[n], collapse=",")
   })
}
add_NAs( c(NA, "1:35,2:20", "2:10,3:30", "1:20,3:5") )
[1] "NA,NA,NA" "35,20,NA" "NA,10,30" "20,NA,5" 

You can also use xmlAttrsToDataFrame to parse the attributes.

x <- XML:::xmlAttrsToDataFrame(doc["//r[@toe]"], stringsAsFactors=FALSE)
x$adj <- add_NAs(x$adj)
separate(x, adj, c("adj1", "adj2", "adj3"), sep="," , convert=TRUE)
       toe veh c0 cur adj1 adj2 adj3
1 deairfor  22 30  30   NA   NA   NA
2 deairfor  23 40  20   35   20   NA
3 deairfor  26  2   2   NA   10   30
Chris S.
  • 2,185
  • 1
  • 14
  • 14
0

Thanks Chris for your help, really answered all my questions! Final code shown below for anyone who is interested.

I only had to insert a line that downloaded the xml file first otherwise it wouldn't pick it up. Topic I used for this: (https://stackoverflow.com/questions/24139221/reading-and-understanding-xml-in-r) Additionally for this table I wanted the level to 'continue' after the adjustments, which is what I did with the 5 similar rows at the end. So if c0 =10, adj1=20 and adj2=NA then adj2/Tier2=20.

library(XML)
library(tidyr)
add_NAs <- function(x, n=5){
  y <- strsplit(x, ",")
  sapply( y, function(z){  
  n <- match(1:n, substr(z,1,1))
  paste(substring(z, 3)[n], collapse=",")
  })
}

fileURL <- "http://wiretap.wwiionline.com/xml/toes.sheet.xml"
download.file(fileURL, destfile=tf <- tempfile(fileext=".xml"))
doc <- xmlParse(tf)
Test <- XML:::xmlAttrsToDataFrame(doc["//r[@toe]"], stringsAsFactors=FALSE)
Test$adj <- add_NAs(Test$adj)
Test <- separate(data = Test, col = adj, into = c("Tier1","Tier2","Tier3","Tier4","Tier5"), sep = "\\,")
Test$Tier1 <- ifelse(Test$Tier1=="NA",Test$c0,Test$Tier1)
Test$Tier2 <- ifelse(Test$Tier2=="NA",Test$Tier1,Test$Tier2)
Test$Tier3 <- ifelse(Test$Tier3=="NA",Test$Tier2,Test$Tier3)
Test$Tier4 <- ifelse(Test$Tier4=="NA",Test$Tier3,Test$Tier4)
Test$Tier5 <- ifelse(Test$Tier5=="NA",Test$Tier4,Test$Tier5)
Community
  • 1
  • 1
M.R.
  • 36
  • 5