1

I want to extract links from html, using jsoup

Expected output: absolute link.

I use "abs:href" for that.

This works:

Jsoup.parse("<a \n\r\t  href=\"http://www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");

delivers: http://www.ibm.com/123/?id=abc

This doesnt work:

Jsoup.parse("<a \n\r\t  href=\"www.ibm.com/123/?id=abc\">\nhaha</a>", "http://www.ibm.com");

delivers: http://www.ibm.com/www.ibm.com/123/?id=abc

I know its kinda difficult to know whether "www.ibm.com" is an absolute or relative link. It might be a top level domain, but also a foldername. Any proven solutions? Just this hack comes into my mind:

String domain = url.replace("http://", "");
url.replace(domain + domain, domain);
user1782357
  • 313
  • 2
  • 11

1 Answers1

0

Your second example is unambiguously a relative URL. An absolute URL, by definition, starts with a protocol (e.g. http or https). All browsers will give the same output for your example.

Can you provide an example URL that you're working with? Why does it have these pseudo-absolute URLs?

Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47