1

I have a few thousand xml files that I would like to read into R. The problem is that some of these files have three special characters "" in the beginning of the file that stops xmlTreeParse from reading the xml file. The error that I get is the following...

Error: 1: Start tag expected, '<' not found

This is due to the first line in the xml file that is the following...

<?xml version="1.0" encoding="utf-8"?>

If I manually remove the characters using notepad, I have this in the beginning of the xml file and I am able to read the xml file...

<?xml version="1.0" encoding="utf-8"?>

I'd like to be able to remove the characters automatically. The following is the code that I have written currently.

filenames <- list.files("...filepath...", pattern="*.xml", full.names=TRUE)

files <- lapply(filenames, function(f) {
  xmlfile <-tryCatch(xmlTreeParse(file = f), error=function(e) print(f))
  xmltop <- xmlRoot(xmlfile)
  plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
  name <- unname(plantcat$EntityNames)
  return(name)
})

I'm wondering how I can read the xml file in by removing the special characters in R. I have tried tryCatch as you can see above but I'm not sure how can edit the xml file without actually reading it in first. Any help would be appreciated!

Edit: Using the following parsing code fixed the problem. I think when I opened the xml file in notepad, it was showing "" but in reality it was this following string "". It's possible that this was due to the encoding of the file but I'm not sure of the specifics. Thank you @Prem.

xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
Atreya Dey
  • 23
  • 5

3 Answers3

1

The special chars from the beginning might come from a different encoding for the file, especially if your xml contains some special characters.

Try to specify the encoding. To identify what encoding is used, open the file as hexa and read the first bytes.

My hunch is that your special chars comes from BOM:
http://unicode.org/faq/utf_bom.html

alex.pulver
  • 2,107
  • 2
  • 31
  • 31
  • Tried the following command for a specific problem xml file and got the same start tag expected error. xmlfile <-xmlParse(file = "filename.xml", encoding= "UTF-8-BOM") – Atreya Dey Feb 09 '18 at 12:00
0

Have you tryed with the gsub function?. It is a very convenient function for characters replacement (and deletion). This works for me:

gsub ('','',string, fixed=TRUE)

On a string = '<?xml version="1.0" encoding="utf-8"?>' variable.

EDIT: I would also suggest you to use the sed function if you're using a computer with GNU/Linux. It's a very powerful tool that would deal perfectly with this task.

elcortegano
  • 2,444
  • 11
  • 40
  • 58
  • I've tried this but I have to be able to read in the file as a string first and then use gsub. Do you know of a way to read an xml file as a string? – Atreya Dey Feb 09 '18 at 11:33
  • What you could do then is to parse your XML file to a dataframe (see [here](https://stackoverflow.com/questions/17198658/how-to-parse-xml-to-r-data-frame)) and use `gsub` over each row for instance. – elcortegano Feb 09 '18 at 11:41
  • Sorry I think I was a bit unclear. When I either use xmlParse or xmlTreeparse I get the error Error: 1: Start tag expected, '<' not found . So I'm unable to put the file into a dataframe. – Atreya Dey Feb 09 '18 at 11:55
  • Mm.... in my opinion you could try with `sed` in GNU/Linux. Is that possible for you? You could write then something similar to `sed -i -e 's/$//' your_files.xml`. – elcortegano Feb 09 '18 at 12:08
0

In your code use readLines to read file and then gsub can be used to remove junk value from the string.

xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
Prem
  • 11,775
  • 1
  • 19
  • 33