1

I am trying to parse XML into R but I am getting this error:

Entity `thinsp` not defined

I have found the entity as &thinsp but I don't know how to deal with it. I would really appreciate your help. I have tried the following:

file1 <- xmlTreeParse("1496019.xml",useInternalNodes = TRUE)
file2 <- xmlParse("1496019.xml",useInternalNodes = TRUE)

Please find the sample code below

<!DOCTYPE om  PUBLIC "" "sm.dtd"><servinfo>
<servinfosub>
<title>Circuit Description</title>
<ptxt>The commanded throttle position (TP) is compared to the actual TP.</ptxt>
</servinfosub>
<servinfosub>
<title>DTC Descriptor</title>
<ptxt>This diagnostic procedure supports the following DTC:</ptxt>
<ptxt>DTC&thinsp;P2101 Throttle Actuator Position Performance</ptxt>
</servinfosub>
<servinfosub>
<title>Diagnostic Aids</title>
<list1 type="unordered-bullet">
<item><ptxt>The throttle valve should be open approximately 20&thinsp;percent. </ptxt></item>
<item><ptxt>If the throttle blade becomes stuck, DTC&thinsp;P1516 and/or P2119 will set. </ptxt></item>
<item>
<important><title>Important</title><ptxt> this function.</ptxt></important>
<ptxt>The scan tool has the ability to operate the throttle control system using Special Functions. </ptxt></item>
<item><ptxt>Inspect for the following conditions:</ptxt></item>
<list2 type="unordered-dash">
<item><ptxt>Use the  <object-link object-id="8917"/> Connector Test Adapter Kit for any test that requires probing the PCM harness connector or a component harness connector.</ptxt></item>
<item><ptxt>Poor connections at the PCM or at the component—Inspect the harness connectors for a poor terminal to wire connection. Refer to  <cell-link cell-id="62112"/> for the proper procedure.</ptxt></item>
<item><ptxt>For intermittents, refer to  <cell-link cell-id="81512"/>.</ptxt></item>
</list2>
</list1>
</servinfosub>
</servinfo>
Karan Pappala
  • 581
  • 2
  • 6
  • 18
  • Please post reproducible examples http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – lukeA Jun 30 '15 at 09:18

1 Answers1

0

One way to come around this would be to preprocess the document and just replace the unknown entity:

library(XML)
txt <- '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><entry>abc&thinsp;</entry>'
xml <- xmlParse(txt, asText = TRUE)
# Error: 1: Entity 'thinsp' not defined

txt <- gsub("&thinsp;", "", txt, fixed = TRUE)
(xml <- xmlParse(txt, asText = TRUE))
# <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
# <entry>abc</entry>  
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • I have tried it and I am getting the error as follows: [Error: XML content does not seem to be XML: '1496019.xml'] – Karan Pappala Jun 30 '15 at 13:21
  • I cannot reproduce your error. When I fill `txt` with the example you provided (like I did above) and run `txt <- gsub(" ", "", txt, fixed = TRUE); (xml <- xmlParse(iconv(txt, to = "UTF-8"), asText = TRUE))`, I get the desired XMLInternalDocument object. – lukeA Jun 30 '15 at 14:36
  • @KaranPappala If this answer helps, then you can check it and mark the topic as solved. – lukeA Jul 02 '15 at 20:19