1

I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. To do this I need a html parser of course and I thought directly about JSoup. But my problem is how do i do it exactly? I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program?

with this code:

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News? I can't use a rss feed as my boss wants the complete articles to be displayed.

Wojciech Wirzbicki
  • 3,887
  • 6
  • 36
  • 59
Laz22434
  • 373
  • 1
  • 12
  • can u try with [hacker news](https://news.ycombinator.com/) link? by the way what is the error you are getting? – Gauthaman Sahadevan Mar 29 '18 at 05:55
  • 1
    Scroll to the link that says [RSS](https://globalnews.ca/pages/feeds/) and use [RSS](https://en.wikipedia.org/wiki/RSS). In fact, I should post this as **the answer**. Oh, and [here](https://globalnews.ca/world/feed/) is the world feed. – Elliott Frisch Mar 29 '18 at 05:55
  • Thank you Gauthaman and Elliot, but i already thought about this and my boss don't want a rss feed, he wants all 5 articles complete, not as a preview, like it's displayed in rss feeds. – Laz22434 Mar 29 '18 at 06:33

1 Answers1

1

First you need to download the main page:

    Document doc = Jsoup.connect("https://globalnews.ca/world/").get();

Then you select links you are interested in for example with css selectors You select all a tags that contains href with text globalnews and are nested in h3 tag with class story-h. Urls are in href attribute of a tag.

    for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
        System.out.println(e.attr("href"));
    }

Then the resulting urls you can process as you wish. You can download content of the first five of then using syntax from the first line etc.

Luk
  • 2,186
  • 2
  • 11
  • 32