Use Jsoup to get all href values from a specific class

Question

I was trying to parse my university website, to get a list of news (title + link) from main site. However, as I'm trying to parse a full website, links that I am looking for are nested deep in other classes, tables etc. Here's the code I tried to use:

String url = "http://www.portal.pwr.wroc.pl/index,241.dhtml";
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("table.cwrapper .tbody .tr td.ccol2 div.cwrapper_padd div#box_main_page_news.cbox.grey div#dyn_main_news.cbox.padd2 div.nitem table.nitemt .tbody .tr td.nitemcell2 span.title_1");
    ArrayList <String> listOfLinks = new ArrayList <String> ();
    int counter = 0;


    for (Element link : links) {

        listOfLinks.add(link.text());

    }

But it doesn't work. Is there a better way to get a href values and titles of all those links, if every one of them is placed in:

<span class = "title_1">
    <a href="Link Adress">Link Title</a>
</span>

Maybe some kind of loop, that would iterate over all of those tags, taking values from them?

Thanks for help :-)

Why not simply do, `doc.select("a[href]");` and then call `.attr("href")` and `.text()` on each Element in the Elements returned by the selection? — Hovercraft Full Of Eels, Sep 03 '16 at 00:58

score 4 · Accepted Answer · answered Sep 03 '16 at 02:10

Your main problem is that the information you're looking for, does not exist at the URL you're using, but at http://www.portal.pwr.wroc.pl/box_main_page_news,241.dhtml?limit=10.
You should first get that page, and than use this (it's a combination of Hovercraft and Andrei volgon's answers) -

String url = "http://www.portal.pwr.wroc.pl/box_main_page_news,241.dhtml?limit=10";
String baseURL = "http://www.portal.pwr.wroc.pl/";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select(".title_1 > a");
for (Element link : links) {
    System.out.println("Title - " + link.text());
    System.out.println(baseURL + link.attr("href"));
}

Well, I've downloaded the page and saw that it didn't contain a `title_1` div. Then I've opened the browser's developer tools and saw that there are multiple `get/post` requests when downloading the main page. Luckily it was the second request. — TDG, Sep 03 '16 at 08:44

score 0 · Answer 2 · answered Sep 03 '16 at 00:47

0

You need to find the least complex unique selector that selects the right elements. In your case the solution is very simple:

doc.select(".title_1 > a")

answered Sep 03 '16 at 00:47

Andrei Volgin

40,755
6
49
58

score 0 · Answer 3 · answered Sep 03 '16 at 01:01

Why not simply do, doc.select("a[href]"); and then call .attr("href") and .text() on each Element in the Elements returned by the selection?

For example:

String path = "http://www.portal.pwr.wroc.pl/index,241.dhtml";
int timeoutMillis = 10 * 1000;
try {
    URL url = new URL(path);
    Document doc = Jsoup.parse(url, timeoutMillis);

    Elements selections = doc.select("a[href]");
    String format = "%-40s %s%n";
    for (Element element : selections) {
        System.out.printf(format, element.attr("href"), element.text());
    }

} catch (IOException e) {
    e.printStackTrace();
}

What's the point of going through all the links, when it's so easy to select only those 7 links that the OP wants? — Andrei Volgin, Sep 03 '16 at 04:42

Use Jsoup to get all href values from a specific class

3 Answers3

Linked