0

I am trying to crawl a web-page which requires authentication. I am able to access that page in browser when I am logged in, using JSoup http://jsoup.org/ library to parse HTML pages.

public static void main(String[] args) throws IOException {

    // need http protocol
    Document doc = Jsoup.connect("http://www.secinfo.com/$/SEC/Filing.asp?T=r643.91Dx_2nx").get();

    // get page title

    String title = doc.title();
    System.out.println("title : " + title);

    // get all links
    Elements links = doc.select("a");
    for (Element link : links) {                   
        // get the value from href attribute
        System.out.println("\nlink : " + link.attr("href"));                   
    }
            System.out.println();

  }

Output :

title : SEC Info - Sign In

This is getting the content of the sign in page not the actual URL i am passing. I am registered on secinfo.com and while running this program I am logged in from my default browser Firefox.

nothing
  • 115
  • 1
  • 1
  • 8
  • 2
    You should check how secinfo is requiring the authentication. Generally authentication info goes into http headers. – Juned Ahsan Sep 21 '13 at 05:57
  • You'd have to interact with the login page and login (fill the form and press submit). Jsoup does not do that. I suggest HtmlUnit. If that can be an option, let me know if you'd like an example of that using HtmlUnit. – acdcjunior Sep 21 '13 at 17:31

3 Answers3

0

This will not help even if you are logged in using your default browser. Your java program is a separate process and it doesn't share the screen with your browsers.

On the other hand secinfo needs an authentication and JSoup allows you to pass authentication details.

It works for me when I pass the authentication details:

Please check this answer (Jsoup connection with basic access authentication)

Community
  • 1
  • 1
dharam
  • 7,882
  • 15
  • 65
  • 93
  • Thanks for reply dharma, I am still getting sign in page, this is what i did : `String username = "myUserName"; String password = "myPass"; String login = username + ":" + password; String base64login = new String(Base64.encodeBase64(login.getBytes())); doc = Jsoup.connect("http://www.secinfo.com/$/SEC/Filing.asp?T=r643.91Dx_2nx").header("Authorization", "Basic " + base64login).get();` – nothing Sep 21 '13 at 07:12
0

Jsoup's connect() also support a post() with method chaining, if your target site's login mechanism work with POST request:

Document doc = Jsoup.connect("url")
  .data("aUserName", "myUserName")
  .data("aPassword", "myPassword")
  .userAgent("Mozilla")
  .timeout(3000)
  .post();

But what if the page you are trying to get requires subsequent cookie sending for each request ? Try to use HttpURLConnection with POST and read the cookie from HTTP connection response header. HttpClient will make this task easier for you. Use the library to fetch a web page as string and then pass the string to jsoup.parse() function to get the document.

Sage
  • 15,290
  • 3
  • 33
  • 38
0

You have to sign in with a post command and preserve the cookies you get back. That is where you session info is stored. I wrote an example here: Jsoup can't Login on Page. The website in the example is an exception it sets the session cookie already on the login page. You can leave that step if it is work for you.

The exact post command can be different from website to website. You have to dig it out from the html or you have to install a plugin to your browser and intercept the post commands.

Community
  • 1
  • 1
Peter Ambruzs
  • 7,763
  • 3
  • 30
  • 36