1

Need help extracting the URL values from HTML META tag using JSOUP. Here is my html -

String html = "<HTML><HEAD><...></...><META ......><META ......><META http-equiv="refresh" content="1;URL='https://xyz.com/login.html?redirect=www.google.com'"></HEAD></HTML>"

Output expected : "https://xyz.com/login.html?redirect=www.google.com"

Can anyone please tell me how to do that. Thanks

Ravs
  • 261
  • 2
  • 7
  • 15

2 Answers2

6

Assuming, it's the first META

String html_src = ...

Document doc = Jsoup.parse(html);
Element eMETA = doc.select("META").first();
String content = eMETA.attr("content");
String urlRedirect = content.split(";")[1];
String url = urlRedirect.split("=")[1].replace("'","");
Pete Houston
  • 14,931
  • 6
  • 47
  • 60
  • In my opinion this solution do not work: As there are 2 = in the content string the url this solution returns will be https:// xyz.com/login.html?redirect . If one wants his redirect to an url with an ' inside the url you would delete all occurences of ' with your replace statement. – mmx73 Jan 19 '16 at 14:57
1

With Java 8 and Jsoup this solution will work:

Document document = Jsoup.parse(yourHTMLString);
String url = document.select("meta[http-equiv=refresh]").stream()
                .findFirst()
                .map(doc -> {
                    try {
                        String content = doc.attr("content").split(";")[1];
                        Pattern pattern = Pattern.compile(".*URL='?(.*)$", Pattern.CASE_INSENSITIVE);
                        Matcher m = pattern.matcher(content);
                        String redirectUrl = m.matches() ? m.group(1) : null;
                        return redirectUrl == null ? null :
                                redirectUrl.endsWith("'") ? redirectUrl.substring(0, redirectUrl.length() - 2) : redirectUrl;
                    } catch (Exception e) {
                        return null;
                    }

                }).orElse(null);
mmx73
  • 972
  • 1
  • 8
  • 19