I have a few thousand xml files that I would like to read into R. The problem is that some of these files have three special characters "" in the beginning of the file that stops xmlTreeParse from reading the xml file. The error that I get is the following...
Error: 1: Start tag expected, '<' not found
This is due to the first line in the xml file that is the following...
<?xml version="1.0" encoding="utf-8"?>
If I manually remove the characters using notepad, I have this in the beginning of the xml file and I am able to read the xml file...
<?xml version="1.0" encoding="utf-8"?>
I'd like to be able to remove the characters automatically. The following is the code that I have written currently.
filenames <- list.files("...filepath...", pattern="*.xml", full.names=TRUE)
files <- lapply(filenames, function(f) {
xmlfile <-tryCatch(xmlTreeParse(file = f), error=function(e) print(f))
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
name <- unname(plantcat$EntityNames)
return(name)
})
I'm wondering how I can read the xml file in by removing the special characters in R. I have tried tryCatch as you can see above but I'm not sure how can edit the xml file without actually reading it in first. Any help would be appreciated!
Edit: Using the following parsing code fixed the problem. I think when I opened the xml file in notepad, it was showing "" but in reality it was this following string "". It's possible that this was due to the encoding of the file but I'm not sure of the specifics. Thank you @Prem.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)