1

Create a file with this content:

<xml>yen symbol - ¥</xml>

Open the file in firefox, you get this error :

XML Parsing Error: not well-formed
Location: file:///test.xml
Line Number 1, Column 19:<xml>yen symbol - </xml>
------------------^

How can I escape the special characters in XML ?

NOTE : I'm using .Net XmlDocument.OuterXML property to retrieve the XML. For some reason, .net doesnt escape the yen character automatically.

Update: The real problem I have is I construct the xml in .net through code and push the xml over http to Solr. Java code inside solr breaks because it considers the yen character as malformed xml. I set the encoding to UTF-8.

Public Shared Sub UpdateRecords(p_SolrRecordCollection As SolrRecordCollection, Optional commit As Boolean = True, Optional optimize As Boolean = True)
            Try
                Dim webClientInstance As New WebClient()
                webClientInstance.Headers.Add("Content-Type", "text/xml")
                webClientInstance.Encoding = System.Text.Encoding.UTF8
                Dim xml = p_SolrRecordCollection.XmlDocument.OuterXml
                Dim params As String = String.Format("?commit={0}&optimize={1}", commit.ToString.ToLower, optimize.ToString.ToLower)
                webClientInstance.UploadString(SolrURL + UpdateRelativeURL + params, xml)
            Catch ex As WebException
                Dim responseText As String = String.Empty
                If ex.Response IsNot Nothing Then
                    responseText = " :" & ControlChars.NewLine
                    Using reader = New StreamReader(ex.Response.GetResponseStream())
                        responseText = reader.ReadToEnd()
                    End Using
                End If
                Throw New Exception("Request to Solr failed" & responseText, ex)
            End Try
        End Sub

This is the error reported by Solr

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">135</int></lst><lst name="error"><str name="msg">[com.ctc.wstx.exc.WstxLazyException] Illegal character entity: expansion character (code 0xb) not a valid XML character
 at [row,col {unknown-source}]: [827,871]</str><str name="trace">[com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
 at [row,col {unknown-source}]: [827,871]
    at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
    at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
    at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
    at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:365)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
    at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
    at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
    at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
    at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
 at [row,col {unknown-source}]: [827,871]
    at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
    at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
    at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2400)
    at com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2346)
    at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1205)
    at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4677)
    at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
    at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
    ... 36 more
</str><int name="code">500</int></lst>
</response>
MD Luffy
  • 536
  • 6
  • 18
  • Possible duplicate of [String escape into XML](http://stackoverflow.com/questions/1132494/string-escape-into-xml) – Vova Sep 30 '15 at 23:23

3 Answers3

4

The file that you are creating is not getting saved to UTF-8; it's probably ASCI. You can prove this to yourself by opening it and using notepad or any other text editing tool that can save files in UTF-8 encoding. In notepad when you "Save as..." you have an option drop down box for the encoding. The default shows you the encoding that the file already is in.

You do not need to escape the Yen character at all. If the file is converted to UTF-8, firefox or any XML interpreter should have no issue with it.

Your error messages lead me to believe that the yen character is a red herring.

expansion character (code 0xb) not a valid XML character

This is a vertical tab character in UTF-8. It sounds like there is some corruption in an encoding conversion. I'm not sure what encoding your SolrRecordCollection object is returning, but I'm guessing it's UTF-8. If you can, find out what encoding the XmlDocument method is returning.

The WebClient.UploadString Method does an encoding conversion:

Before uploading the string, this method converts it to a Byte array using the encoding specified in the Encoding property.

So I'm guessing what might be happening is that it's trying to take a UTF-8 string and interpret it as a standard .NET UTF-16 string and then converts this misinterpreted data to UTF-8. I think if you convert your XML string variable to UTF-16 before sending it to the method it might fix your problem. Here's a question that answers how to do that:

How do you convert an xml string with UTF-8 encoding UTF-16?

FYI, This article is an easy read to help understand text encodings:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Community
  • 1
  • 1
Brian Pressler
  • 6,653
  • 2
  • 19
  • 40
  • It doesn't look like your error messages is complaining about any yen characters. See additions to my answer. – Brian Pressler Oct 01 '15 at 22:41
  • I tracked down the exact character (cant show data here) & found it was the yen symbol. For some stupid reason, its interpreted as 0xb. – MD Luffy Oct 02 '15 at 01:09
1

Make sure you save the file with an encoding that properly handles the yen character and that will be recognized by Firefox, e.g. UTF-8. (It seems to me Firefox is expecting Unicode if nothing else is specified, but I didn't verify this.) Then there is no need to escape that character.

Even better, add a heading indicating the encoding used:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>yen symbol - ¥</xml>
Martin
  • 1,986
  • 15
  • 32
0

I went this route : I recoded my upload logic using JSON. I handle all the json escapes using Newtonsoft's Json library. I know this isnt the right solution to the problem, but this is a working solution for all the XML nightmares I went through.

Ref:

https://wiki.apache.org/solr/UpdateJSON

MD Luffy
  • 536
  • 6
  • 18