Cannot extract data from an XML

Question

Im using getElementBytag method to extract data from the following an XML document(Yahoo finance news api http://finance.yahoo.com/rss/topfinstories)

Im using the following code . It gets the new items and the title's no problem using the getelementsBytag method but for some reason wont pick up the link when searched by tag. It only picks up the closing tag for the link element. Is it a problem with the XML document or a problem with jsoup?

import java.io.IOException;         
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;   

class GetNewsXML {
    /**
     * @param args
     */
    /**
     * @param args
     */
    public static void main(String args[]){
        Document doc = null;
        String con = "http://finance.yahoo.com/rss/topfinstories";
        try {
            doc = Jsoup.connect(con).get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        Elements collection = doc.getElementsByTag("item");// Gets each news item
        for (Element c: collection){
            System.out.println(c.getElementsByTag("title"));
        }
        for (Element c: collection){
            System.out.println(c.getElementsByTag("link"));
        }
    }

the page source of the link you provide is an xml document. check the following post http://stackoverflow.com/questions/9886531/how-to-parse-xml-with-jsoup or use a xml parser — , Apr 13 '13 at 22:43
I can parse the all the data from all the tags in it except, for some strange reason , the contents of the tag. I have used doc.select and getElementBytag. Both work on all other tags except the tag. When I try to get the contents of the tag, I get this output: — Faber, Apr 14 '13 at 16:46
based on the link I provided, just replace the line in the try .. catch to the following and it will work: doc = Jsoup.parse(new URL(con).openStream(), "", Parser.xmlParser()); — , Apr 15 '13 at 14:15
you can define an input stream variable in order to close it after the operation finishes. — , Apr 15 '13 at 14:16

score 1 · Accepted Answer · answered Apr 14 '13 at 18:01

1

You get <link /> http://...; the link is put after the link-tag as a textnode.

But this is not a problem:

final String url = "http://finance.yahoo.com/rss/topfinstories";

Document doc = Jsoup.connect(url).get();


for( Element item : doc.select("item") )
{
    final String title = item.select("title").first().text();
    final String description = item.select("description").first().text();
    final String link = item.select("link").first().nextSibling().toString();

    System.out.println(title);
    System.out.println(description);
    System.out.println(link);
    System.out.println("");
}

Explanation:

item.select("link")  // Select the 'link' element of the item
    .first()         // Retrieve the first Element found (since there's only one)
    .nextSibling()   // Get the next Sibling after the one found; its the TextNode with the real URL
    .toString()      // Get it as a String

With your link this example prints all elements like this:

Tax Day Freebies and Deals
You made it through tax season. Reward yourself by taking advantage of some special deals on April 15.
http://us.rd.yahoo.com/finance/news/rss/story/SIG=14eetvku9/*http%3A//us.rd.yahoo.com/finance/news/topfinstories/SIG=12btdp321/*http%3A//finance.yahoo.com/news/tax-day-freebies-and-deals-133544366.html?l=1

(...)

answered Apr 14 '13 at 18:01

ollo

24,797
14
106
155

Thanks, that worked perfectly. I still don't understand why you need to used the nextSibling() method for the link. As far as I can see there is only one element in the link tag so .first() should pick it up. Am I missing something here? – Faber Apr 15 '13 at 10:52
2

I dont know why, but jsoup doesn't parse the `` tag correct. so its not `url here` but ` url here`. The link-tag is empty and its link is *after* it. you can parse the webiste into a document and print it - you'll see what i mean. However, in my example i select the (empty) link tag and get the text after it with `nextSibling()`. The `first()` method is needed, because `select()` returns a instance of `Elements` (= a list of `Element`). – ollo Apr 15 '13 at 17:32
Cheers, that clears it up, thanks. I suppose if you're in doubt you should always parse the website to a document and print it to see whats going on. – Faber Apr 15 '13 at 22:21
@ollo I refer your ans, and got "link" perfectly.. Thank you.. very good explanation.. please see my question on http://stackoverflow.com/questions/17312544/issue-on-parsing-html-with-jsoup. There, I want description alone. – Dhasneem Jun 26 '13 at 06:36
Friendly reminder for anybody coming through that jsoup does support proper XML parsing now (including a fix for the issue with `` tags). See: https://stackoverflow.com/a/10158491/6425776 – DragShot Sep 26 '17 at 14:16

Cannot extract data from an XML

1 Answers1

Linked