Jsoup parse link

Question

I want to extract links from html, using jsoup

Expected output: absolute link.

I use "abs:href" for that.

This works:

Jsoup.parse("<a \n\r\t  href=\"http://www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");

delivers: http://www.ibm.com/123/?id=abc

This doesnt work:

Jsoup.parse("<a \n\r\t  href=\"www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");

delivers: http://www.ibm.com/www.ibm.com/123/?id=abc

I know its kinda difficult to know whether "www.ibm.com" is an absolute or relative link. It might be a top level domain, but also a foldername. Any proven solutions? Just this hack comes into my mind:

String domain = url.replace("http://", "");
url.replace(domain + domain, domain);

Technically, a link as `` is **wrong**. When opened in a webbrowser, it would only open `http://example.com/current/page.html/www.abc.com` and not `http://www.abc.com`. The original HTML page author has definitely to fix it. — BalusC, Dec 21 '12 at 15:39

score 0 · Answer 1 · answered Dec 17 '12 at 05:52

Your second example is unambiguously a relative URL. An absolute URL, by definition, starts with a protocol (e.g. http or https). All browsers will give the same output for your example.

Can you provide an example URL that you're working with? Why does it have these pseudo-absolute URLs?

Jsoup parse link

1 Answers1