1

I am try to parse XML from existing URL.

URL address: http://www.sozcu.com.tr/2016/yazarlar/ugur-dundar/rss

try {
        URL url = new URL(urlAddress);
        DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
        InputSource ıo = new InputSource(url.openStream());
        Document document = (Document) dBuilder.parse(ıo);
        document.getDocumentElement().normalize();

        NodeList nodeList = document.getElementsByTagName("item");
        for (int i = 0; i < nodeList.getLength(); i++) {
            Node node = nodeList.item(i);
            Element mainElement = (Element) node;
            String link = getXMLAttributeValue("link", mainElement);
            String title = getXMLAttributeValue("title", mainElement);
            String desc = getXMLAttributeValue("desc",mainElement);

            sozcuArticle.add(new Sozcu(link, title, desc));
        }
    } catch (ParserConfigurationException | IOException | SAXException ex) {
        System.out.println(ex.getMessage());
    }

Also my getXMLAttributeValue method

public String getXMLAttributeValue(String tag, Element element) {
        NodeList nodeElement = element.getElementsByTagName(tag);
        Element Element = (Element) nodeElement.item(0);
        return Element.getChildNodes().item(0).getNodeValue();
    }

When I run the program.I am getting exception.

[Fatal Error] :51:119: Attribute name "async" associated with an element type "script" must be followed by the ' = ' character.
Attribute name "async" associated with an element type "script" must be followed by the ' = ' character.
[Fatal Error] :5:409: Element type "n.length" must be followed by either attribute specifications, ">" or "/>".
Element type "n.length" must be followed by either attribute specifications, ">" or "/>".

I also search it in google but I can't find any solution.How can I fix this problem.

Thanks.

Karayel
  • 37
  • 1
  • 2
  • 10
  • One issue I found is that your statement: `InputSource ıo` uses strange characters for 10 or whatever that is. Change it to io or something else. – pczeus Feb 21 '16 at 20:49
  • I was able to reproduce you issue. It is also helpful if you change the print out of the `ex.getMessage()`, which doesn't provide much info to `ex.printStackTrace()` – pczeus Feb 21 '16 at 20:53
  • The issue is related to the character encoding in the xml. I'm not sure what the character set is, you may have to play with that. You can set the encoding on your InputSource if you can find the right one, for example: `io.setEncoding("UTF-16"); ` – pczeus Feb 21 '16 at 21:10
  • My StackTrace : [Fatal Error] :5:409: Element type "n.length" must be followed by either attribute specifications, ">" or "/>". Element type "n.length" must be followed by either attribute specifications, ">" or "/>".[Fatal Error] :51:119: Attribute name "async" associated with an element type "script" must be followed by the ' = ' character. Attribute name "async" associated with an element type "script" must be followed by the ' = ' character. – Karayel Feb 21 '16 at 21:43
  • When ı changed encoding to UTF-16.I get another exception like [Fatal Error] :1:1: Content is not allowed in prolog. Content is not allowed in prolog. – Karayel Feb 21 '16 at 21:44
  • Yes, I have seen that. UTF-16 is still not the right encoding..not sure what is. – pczeus Feb 21 '16 at 22:25
  • http://www.freeformatter.com/xml-formatter.html#ad-output open and enter the url.Encoding type is UTF-8 and also rss version is 0.92.When I change encoding to UTF-8.Its show me same error. – Karayel Feb 21 '16 at 22:32
  • http://stackoverflow.com/questions/11577420/fatal-error-11-content-is-not-allowed-in-prolog for UTF-16 exception. @MvG answered the question but I am not sure how can I fix the problem. – Karayel Feb 21 '16 at 22:34
  • It's also possible there is some type of extra character at the beginning of the xml, which causes the prolog error. Maybe pull in the entire URL into a String and trim() it. – pczeus Feb 21 '16 at 22:40
  • I will try to solve this problem.If I succed , I will inform you.Thanks for helping – Karayel Feb 21 '16 at 22:56

1 Answers1

0

They block or redirect clients that use default java user agent properties(something like: Java/1.8.0_71). Add your own user agent and it works just fine:

DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Karayel's rss reader");
Document document = (Document) dBuilder.parse(uc.getInputStream());

Parsing feeds manually is tedious and error prone(your getXMLAttributeValue method throws NullPointerExceptions). I suggest you use something like rometools instead.

janih
  • 2,214
  • 2
  • 18
  • 24