I was trying to parse the cars review dataset from the repository provided by http://www.kavita-ganesan.com/entity-ranking-data
The data is a series of files containing text formatted as
<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
.....
This is not valid XML although it really looks like XML.
I've come with the idea of forcing it to be valid XML by appending the tags <file>
and </file>
at the beginning and end of the text.
library(XML)
#read the file and append the tags
file = c("<file>",readLines("2007/2007_nissan_versa"),"</file>")
#remove invalid characters
file = gsub(pattern = "[&\"\']",replacement = "",x = file)
xmlParse(file)
It does work and then it can be parsed by xmlParse, however, I wonder if there is a more elegant solution out there.