Parsing Invalid XML in R

Question

I was trying to parse the cars review dataset from the repository provided by http://www.kavita-ganesan.com/entity-ranking-data

The data is a series of files containing text formatted as

<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
.....

This is not valid XML although it really looks like XML.

I've come with the idea of forcing it to be valid XML by appending the tags <file> and </file> at the beginning and end of the text.

library(XML)

#read the file and append the tags
file = c("<file>",readLines("2007/2007_nissan_versa"),"</file>")

#remove invalid characters
file = gsub(pattern = "[&\"\']",replacement = "",x = file)

xmlParse(file)

It does work and then it can be parsed by xmlParse, however, I wonder if there is a more elegant solution out there.

Thanks @Aurèle. But I wonder if there is a more efficient solution that does not require to read through the data twice and use gsub. — comendeiro, Aug 31 '17 at 09:44

Aurèle · Answer 1 · 2017-08-31T10:15:39.267

1

Really what you tried looks fine to me.

This is more of a toy answer with scan(), that shows a different way of parsing such files:

data.frame(scan(
  textConnection("<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>"),
  what = list(X1="", DATE="", AUTHOR="", TEXT="", FAVORITE="", X2=""),
  multi.line = TRUE,
  sep = "\n"
), stringsAsFactors = FALSE)

#      X1                   DATE                     AUTHOR                   TEXT                       FAVORITE     X2
# 1 <DOC> <DATE>Some Text</DATE> <AUTHOR>Some Text</AUTHOR> <TEXT>Some Text</TEXT> <FAVORITE>Some text</FAVORITE> </DOC>
# 2 <DOC> <DATE>Some Text</DATE> <AUTHOR>Some Text</AUTHOR> <TEXT>Some Text</TEXT> <FAVORITE>Some text</FAVORITE> </DOC>

edited Aug 31 '17 at 10:15

answered Aug 31 '17 at 10:10

Aurèle

12,545
1
31
49

Thank you for this alternative approach. It does the job, however still requires an additional step to remove the tags from the variables. Isn't any way of reading all in a single pass through the data? – comendeiro Aug 31 '17 at 10:38
Apart from preprocessing the data with a tool such as sed, I don't see how... (yet) – Aurèle Aug 31 '17 at 10:40
In the end, your answer was very useful to me. I had some problems trying to parse it as XML so I opted to treat it as text and parse it line by line. – comendeiro Sep 04 '17 at 08:20

score 0 · Answer 2 · answered Aug 31 '17 at 11:20

Create a wrapper document like this:

<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "actual.xml">
]>
<wrapper>&e;</wrapper>

Where "actual.xml" is your current file (in the same directory); and then parse the wrapper document.

Technically, your input is a well-formed external general parsed entity, but it is not a well-formed document entity. Validity doesn't come into it, because there is no schema or DTD.

Parsing Invalid XML in R

2 Answers2