0

I'm coding a crawler which retrieves some Facebook posts and serialize them as XML.

My problem is the following: I've found that some messages with some special characters (such as \b), when I wrote it to my XML are serialized as 

If I try to open back this XML with Java DOM parser (with the ), I obtain an error because it is not capable to parse this character.

How can I solve it?

Data examples: http://pastebin.com/3xEK5QbV

The error given by the parser when I load the resulting XML is:

[Fatal Error] out.xml:7:59: La referencia de caracteres "&# org.xml.sax.SAXParseException; systemId: file:/Z:/Programas/Workspace%20Eclipse/workspace/Test/out.xml; lineNumber: 7; columnNumber: 59; La referencia de caracteres "&# at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) at Test.loadBadXML(Test.java:43) at Test.(Test.java:32) at Test.main(Test.java:139)

About source code I've three related source codes:

First one: Obtaining "malformed (with \b)" data from JSON from facebook:

// post is the object which contains the "post"
// URL_BASE_GRAPH, and TOKEN are constants which contains Strings necessary to create the URL for Facebook graph API
// idPost is the ID of the post that I'm retrieving

String urlStr = URL_BASE_GRAPH + idPost + "?access_token=" + TOKEN;
URL url = new URL(urlStr);
ObjectMapper om = new ObjectMapper();
JsonNode root = om.readValue(url.openStream(), JsonNode.class);
...    
JsonNode message = root.get("message");
if (message != null) {
        post.setMessage(message.asText());
}

Second one: Writing this data as XML:

// outFile is the file to be written
                File file = new File(outFile);
                DocumentBuilderFactory docFactory = DocumentBuilderFactory
                                .newInstance();
                DocumentBuilder docBuilder = docFactory.newDocumentBuilder();

                // root elements
                Document doc = docBuilder.newDocument();
                Element rootElement = doc.createElement("groups");
                doc.appendChild(rootElement);

                ....

                if (post.getMessage() != null) {
                        Element messagePost = doc.createElement("post_message");
                        // I've tried also this: messagePost.appendChild(doc.createTextNode(StringEscapeUtils.escapeXml(post.getMessage())));
                        messagePost.appendChild(doc.createTextNode(post.getMessage()));
                        postEl.appendChild(messagePost);
                }

                ....

                TransformerFactory transformerFactory = TransformerFactory.newInstance();
                Transformer transformer = transformerFactory.newTransformer();
                transformer.setOutputProperty(OutputKeys.INDENT, "yes");
                transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
                DOMSource source = new DOMSource(doc);
                StreamResult result = new StreamResult(file);
                transformer.transform(source, result);

Third one: Loading again the XML (malformed with ) from the XML:

 File fXmlFile = new File(f);
                DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
                DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
                Document doc = dBuilder.parse(fXmlFile);
                doc.getDocumentElement().normalize();
                ....
                Node pstNode = postNode.item(j);
                if (pstNode.getNodeType() == Node.ELEMENT_NODE) {
                        Element pstElement = (Element) pstNode;
                        String pstMessage = null;
                        if (pstElement.getElementsByTagName("post_message").item(0) != null)
                                pstMessage = pstElement.getElementsByTagName("post_message").item(0).getTextContent();

Any thoughts?

Thanks!

alejandrorg
  • 55
  • 1
  • 7

2 Answers2

0

Scraping Facebook is against it's automated data collection terms. Besides that, there's an API for that.

lars.schwarz
  • 1,276
  • 1
  • 8
  • 12
0

The only answer that I've found is using a regexp to remove invalid xml 1.0 characters.

I attach the link:

removing invalid XML characters from a string in java

Community
  • 1
  • 1
alejandrorg
  • 55
  • 1
  • 7