0

I am not an R-pro, but merely a humble user merging different information into one script which fits my needs. Sadly, I stumpled upon a problem I could not solve, which is why i came here:

I´d like to fit data from many XML files into one data frame. Now, one of the variables/columns (extracted from a specific node) is pretty large (lots of text with many linebreaks etc). When parsing the XMLs and coercing the extracted information into a df (writing it into a .txt file with tab delimiter), R writes this one large variable not as one column, which makes it impossible to deal with the ouput as a data frame.

Now, to solve this conundrum, I´d like to insert something like

gsub("[\t\n]", "", xmlValue)

as an argument of the sixth xpathSApply function to get rid of the linebreaks. How can it be integrated? Or is there another answer?

Here is my Code so far:

rm(list = ls())
setwd("L:/.../testfiles")
library(XML)

list.files(path = ".", pattern = NULL, all.files = FALSE,
       full.names = FALSE, recursive = FALSE,
       ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

files <- dir(path = ".", pattern = NULL, all.files = FALSE,
         full.names = FALSE, recursive = FALSE,
         ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE) 

for(i in 1:2){

tryCatch({

motion <- xmlParse(files[i]) 
root <- xmlRoot(motion)


frame <- data.frame(
  "." = xpathSApply(root, "//dokument/datum", xmlValue),
  "." = xpathSApply(root, "//dokument/subtyp", xmlValue),
  "." = xpathSApply(root, "//dokument/titel", xmlValue),
  "." = xpathSApply(root, "//dokument/subtitel", xmlValue),
  "." = xpathSApply(root, "//dokintressent//namn", xmlValue),
  "." = xpathSApply(root, "//dokument/html", xmlValue),       ## <- the huge node
  check.names=FALSE, check.rows=FALSE)

colnames(frame)[1] <- "" 
colnames(frame)[2] <- ""
colnames(frame)[3] <- ""
colnames(frame)[4] <- ""
colnames(frame)[5] <- ""
colnames(frame)[6] <- ""


write.table(frame, "L:/.../Satz.txt", 
            sep="\t", append=TRUE, na="NA", row.names=FALSE)

}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})

}

Sample data: 2 files (I am not allowed to post more - just click the links and select "slow download"). http://speedy.sh/S2JsU/modifiedFile1.xml http://speedy.sh/Ce2Jg/modifiedFile2.xml

I seriously hope the solution to this is as hard as it appears to me, and I don´t have to be too ashamed for asking it in front of this noble community.

Thank you all so much!

ch

chwd
  • 3
  • 3

1 Answers1

0

It's difficult when you don't provide a reproducible example with sample input data to test possible solutions. But how about defining a new helper function

xmlCleanValue <-function(x) {
    gsub("[\t\n]", "", xmlValue(x))
}

and then using it like

xpathSApply(root, "//dokument/html", xmlCleanValue)
Community
  • 1
  • 1
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thank you for your quick respond! Of course you are absolutely right, tomorrow I will upload some sample data. Unfortunately, I already designed a helper function like this, but it didn´t work out (no real changes in the output file). Please forgive me for not mentioning it in the post, I thought maybe there was something obvious why this approach cannot work out.. – chwd Aug 26 '14 at 19:33
  • And thank you for the link - this is my first question and I am happy to learn how it is done right! – chwd Aug 27 '14 at 13:13
  • I tested with the files you posted and this function works as expected. There are no line breaks in the returned data.frame. Thus your error is not reproducible. – MrFlick Aug 27 '14 at 18:55
  • You are right, I found what was wrong: I didn´t thought of the output being appended to the existing output file. So I always saw the same text from earlier (unsuccessful) trials... that explains a lot. My bad, thank you so much! – chwd Aug 28 '14 at 11:57