I'm coding a crawler which retrieves some Facebook posts and serialize them as XML.
My problem is the following: I've found that some messages with some special characters (such as \b), when I wrote it to my XML are serialized as 
If I try to open back this XML with Java DOM parser (with the 
), I obtain an error because it is not capable to parse this character.
How can I solve it?
Data examples: http://pastebin.com/3xEK5QbV
The error given by the parser when I load the resulting XML is:
[Fatal Error] out.xml:7:59: La referencia de caracteres "&# org.xml.sax.SAXParseException; systemId: file:/Z:/Programas/Workspace%20Eclipse/workspace/Test/out.xml; lineNumber: 7; columnNumber: 59; La referencia de caracteres "&# at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) at Test.loadBadXML(Test.java:43) at Test.(Test.java:32) at Test.main(Test.java:139)
About source code I've three related source codes:
First one: Obtaining "malformed (with \b)" data from JSON from facebook:
// post is the object which contains the "post"
// URL_BASE_GRAPH, and TOKEN are constants which contains Strings necessary to create the URL for Facebook graph API
// idPost is the ID of the post that I'm retrieving
String urlStr = URL_BASE_GRAPH + idPost + "?access_token=" + TOKEN;
URL url = new URL(urlStr);
ObjectMapper om = new ObjectMapper();
JsonNode root = om.readValue(url.openStream(), JsonNode.class);
...
JsonNode message = root.get("message");
if (message != null) {
post.setMessage(message.asText());
}
Second one: Writing this data as XML:
// outFile is the file to be written
File file = new File(outFile);
DocumentBuilderFactory docFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
// root elements
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("groups");
doc.appendChild(rootElement);
....
if (post.getMessage() != null) {
Element messagePost = doc.createElement("post_message");
// I've tried also this: messagePost.appendChild(doc.createTextNode(StringEscapeUtils.escapeXml(post.getMessage())));
messagePost.appendChild(doc.createTextNode(post.getMessage()));
postEl.appendChild(messagePost);
}
....
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(file);
transformer.transform(source, result);
Third one: Loading again the XML (malformed with 
) from the XML:
File fXmlFile = new File(f);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
....
Node pstNode = postNode.item(j);
if (pstNode.getNodeType() == Node.ELEMENT_NODE) {
Element pstElement = (Element) pstNode;
String pstMessage = null;
if (pstElement.getElementsByTagName("post_message").item(0) != null)
pstMessage = pstElement.getElementsByTagName("post_message").item(0).getTextContent();
Any thoughts?
Thanks!