0

I have a Web Service written in Java. I want to send some strings in the form of a XML file. But these strings may contain some characters that are recognized as illegal in XML. Currently I replace all of them with ?, create the XML and send it over the network (to the Silverlight app). But sometimes all I get are question marks! So I want to somehow encode/decode these strings before and after I send them to get the exact strings. These strings are in UTF-8 encoding. I'm using something like this to create the XML:

try{
    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();

    //root elements
    Document doc = docBuilder.newDocument();
    Element rootElement = doc.createElement("SearchResults");
    rootElement.setAttribute("count", Integer.toString(total));
    doc.appendChild(rootElement);

    for(int i = 0; i < results.size(); i++)
    {
        Result res = results.get(i);
        //title
        Element title = doc.createElement("Title");
        title.appendChild(doc.createTextNode(res.title));
        searchRes.appendChild(title);

        //...
    }
    //write the content into xml file
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
    DOMSource source = new DOMSource(doc);
    StringWriter sw = new StringWriter();
    StreamResult result =  new StreamResult(sw);
    transformer.transform(source, result);
    String ret = sw.toString();
    return ret;
}
catch(ParserConfigurationException pce){
    pce.printStackTrace();
}catch(TransformerException tfe){
    tfe.printStackTrace();
}
return null;

Thank you.

PS: Some people said that they didn't understand my question so maybe I didn't say it right so I try to clarify it with an example. Suppose I have an array of items.
Each item has 3 strings.
These strings are UTF-8 strings (from many languages).
I want to send these strings to the client via a Web Service in Java.
The client part is Silverlight. In the Silverlight app,
I get the XML, parse it and use LinQ to extract data from it and I use that data in my Silverlight app.
When I try to escape the characters, somehow the parser in the Silverlight throws an exception saying that there's an illegal character in the source string (XML string) after debugging I found out that actually there IS an illegal character but I don't know how to produce a guaranteed legal XML string.

Edit: Thank you all for your support. I REALLY appreciate it.
I solved my problem.
Turns out somewhere in my code I was producing an illegal character and appending it to my result string.
The question still remains (How can I produce a legal XML file even though I'm providing it some illegal characters - note that I solved the problem by eliminating the illegal character before producing the XML so I still wonder what if I wanted to somehow send it over?) but since my problem is solved and there's tons of answers here, I guess the future readers have a head start to begin the journey to face this problem.
I didn't have the time but I'm sure these will help.
There's lots of answers and helps so I cannot select one of them to be my specific answer.
But I have to choose one of them.
I sincerely thank all of the responses.

csharpwinphonexaml
  • 3,659
  • 10
  • 32
  • 63
Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
  • Just encode the characters correctly in the first place. A good approach is using the -construction. – Thorbjørn Ravn Andersen Apr 15 '11 at 19:22
  • @Thorbjorn (sorry, not an EU keyboard) - that's escaping, not encoding, and it won't help with characters like 0x01, which are not permitted under XML 1.0. – Anon Apr 15 '11 at 20:49
  • @Alireza - I notice that you're converting the output to a String and then presumably writing it to a stream. A better approach (because it avoids possible encoding bugs) is to pass that stream directly to the transformer. – Anon Apr 15 '11 at 20:55
  • @Anon : In my Web Method, I return this string (ret in the code above) as a result. I didn't get what exactly you said sorry :D – Alireza Noori Apr 15 '11 at 21:38
  • If there's a better way to convert doc to a XML string, please let me know. Thanks – Alireza Noori Apr 15 '11 at 21:44
  • @Alireza - what happens after you return the string from your method? I'm assuming that you then write it to an `OutputStream`. If that's the case, then you need to be aware of the encoding used when writing. And if you're not explicitly setting it, chances are good that it's *not* UTF-8. – Anon Apr 15 '11 at 22:11
  • 1
    To debug this, I suggest checking the original strings to see if they contain illegal characters before you convert them to XML. If they don't, then the problem is how you write the string to the output. – Anon Apr 15 '11 at 22:12
  • And if you need more information about XML output to streams, the top "related" question should help you: http://stackoverflow.com/questions/443305 – Anon Apr 15 '11 at 22:13
  • @Anon I just return the string by the Web Method function (i.e. `return XMLOps.getXML(...)` also when I try this: `cleanStringForXml(res.description, '?')` i.e. removing illegal characters from the source string and replacing them with ?, everything works fine. So I guess somehow the `escapeXml` function cannot convert the illegal characters peroperly because when I return the string created by that, the string contains some illegal characters. In .Net there's a class that creates legal XML for you without any effort. Isn't there sth like that in Java? – Alireza Noori Apr 15 '11 at 22:32
  • @Alireza - OK, I'm getting confused. Are you generating the XML with Java or .Net? You describe using `XmlOps`, which as far as I can tell is a .Net class. I've been assuming, from your first sentence, that you're generating the XML in Java. – Anon Apr 16 '11 at 11:42
  • I'm generating XML in Java and using it in .Net (Silverlight) Sorry to cause confusion but XMLOps is my custom written class in Java:D – Alireza Noori Apr 16 '11 at 18:17

5 Answers5

3

If you're sending non-character data (i.e. binary data for example) in your XML, you might encode them using Base64. But I'm not sure I've understood your question correctly.

Maybe you just forgot to encode your XML in UTF-8, using transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8")

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
  • +1. No other form of XML escaping will let you to have characters like '\0' to be present in XML. – Alexei Levenkov Apr 15 '11 at 20:33
  • Thanks. These are not binary data (they're some strings clipped from web pages) and I don't know how to encode in Base64. Could you provide me a little tutorial or an example? – Alireza Noori Apr 15 '11 at 21:37
  • One more thing, using `transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");` didn't help. – Alireza Noori Apr 15 '11 at 21:43
  • 1
    Alireza, take a look at [Apache Commons Codec] (http://commons.apache.org/codec/). – Anthony Accioly Apr 16 '11 at 05:09
  • 1
    You can use [BCodec](http://commons.apache.org/codec/api-release/org/apache/commons/codec/net/BCodec.html) encode method. Or something like [this](http://www.kodejava.org/examples/375.html). – Anthony Accioly Apr 16 '11 at 05:28
0

Not sure I understand your question, but maybe you should wrap the data under CDATA tag so that its not parsed by the XML parser. Here is the documentation from MSDN.

lobster1234
  • 7,679
  • 26
  • 30
  • CDATA does not permit "illegal" characters. Here is the documentation from the W3C: http://www.w3.org/TR/xml/#dt-cdsection – Anon Apr 15 '11 at 20:47
0

Wrap the content with <![CDATA[ and ]]>.

More info here: http://www.w3schools.com/xml/xml_cdata.asp

Jonas Kongslund
  • 5,058
  • 2
  • 28
  • 27
  • 1
    CDATA is a good approach when you don't want the XML to be parsed (it's the tag original function). But since he is building the XML from scratch to be consumed a more recommended (and just as simple) way would be to escape the Strings. – Anthony Accioly Apr 15 '11 at 19:35
  • CDATA won't allow you to use "illegal" characters (such as 0x01, SOH). It exists so that you can use characters that would normally need escaping, like `<`. But even then, it's not particularly useful. – Anon Apr 15 '11 at 20:48
0

By experience I would recommend escaping / unescaping XML. Take at look at StringEscapeUtils from Apache Commons Lang.

Anthony Accioly
  • 21,918
  • 9
  • 70
  • 118
  • I tried it like this: `desc.appendChild(doc.createTextNode(StringEscapeUtils.escapeXml(res.description)));` but in the silverlight part, when I use this: `XDocument xmlStories = XDocument.Parse(xmlContent);` I get an exception saying that there's an illegal character in the XML! – Alireza Noori Apr 15 '11 at 20:24
  • Characters like '\0' are illegeal in XML. There is no way to escape them (short of custom encoding - see JB Nizet answer for using Base64). – Alexei Levenkov Apr 15 '11 at 20:35
0

You should try the StringEscapeUtils from apache

rekaszeru
  • 19,130
  • 7
  • 59
  • 73