1

I was trying to parse the cars review dataset from the repository provided by http://www.kavita-ganesan.com/entity-ranking-data

The data is a series of files containing text formatted as

<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
.....

This is not valid XML although it really looks like XML.

I've come with the idea of forcing it to be valid XML by appending the tags <file> and </file> at the beginning and end of the text.

library(XML)

#read the file and append the tags
file = c("<file>",readLines("2007/2007_nissan_versa"),"</file>")

#remove invalid characters
file = gsub(pattern = "[&\"\']",replacement = "",x = file)

xmlParse(file)

It does work and then it can be parsed by xmlParse, however, I wonder if there is a more elegant solution out there.

Aurèle
  • 12,545
  • 1
  • 31
  • 49
comendeiro
  • 816
  • 7
  • 14

2 Answers2

1

Really what you tried looks fine to me.

This is more of a toy answer with scan(), that shows a different way of parsing such files:

data.frame(scan(
  textConnection("<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>
<DOC>
<DATE>Some Text</DATE>
<AUTHOR>Some Text</AUTHOR>
<TEXT>Some Text</TEXT>
<FAVORITE>Some text</FAVORITE>
</DOC>"),
  what = list(X1="", DATE="", AUTHOR="", TEXT="", FAVORITE="", X2=""),
  multi.line = TRUE,
  sep = "\n"
), stringsAsFactors = FALSE)

#      X1                   DATE                     AUTHOR                   TEXT                       FAVORITE     X2
# 1 <DOC> <DATE>Some Text</DATE> <AUTHOR>Some Text</AUTHOR> <TEXT>Some Text</TEXT> <FAVORITE>Some text</FAVORITE> </DOC>
# 2 <DOC> <DATE>Some Text</DATE> <AUTHOR>Some Text</AUTHOR> <TEXT>Some Text</TEXT> <FAVORITE>Some text</FAVORITE> </DOC>
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • Thank you for this alternative approach. It does the job, however still requires an additional step to remove the tags from the variables. Isn't any way of reading all in a single pass through the data? – comendeiro Aug 31 '17 at 10:38
  • Apart from preprocessing the data with a tool such as sed, I don't see how... (yet) – Aurèle Aug 31 '17 at 10:40
  • In the end, your answer was very useful to me. I had some problems trying to parse it as XML so I opted to treat it as text and parse it line by line. – comendeiro Sep 04 '17 at 08:20
0

Create a wrapper document like this:

<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "actual.xml">
]>
<wrapper>&e;</wrapper>

Where "actual.xml" is your current file (in the same directory); and then parse the wrapper document.

Technically, your input is a well-formed external general parsed entity, but it is not a well-formed document entity. Validity doesn't come into it, because there is no schema or DTD.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164