52

I'm trying to use jsoup to login to a site and then scrape information, I am running into in a problem, I can login successfully and create a Document from index.php but I cannot get other pages on the site. I know I need to set a cookie after I post and then load it when I'm trying to open another page on the site. But how do I do this? The following code lets me login and get index.php

Document doc = Jsoup.connect("http://www.example.com/login.php")
               .data("username", "myUsername", 
                     "password", "myPassword")
               .post();

I know I can use apache httpclient to do this but I don't want to.

Jochem
  • 25
  • 7
Gwindow
  • 645
  • 2
  • 9
  • 7
  • is that code worked for you to login and crawling info from a website ??coz in my case it is not working – lucifer Jan 26 '15 at 14:26
  • you can see my code here http://stackoverflow.com/questions/28110219/how-to-crawl-a-website-after-login-in-it-with-username-and-password?noredirect=1#comment44615745_28110219 – lucifer Jan 26 '15 at 15:39

6 Answers6

111

When you login to the site, it is probably setting an authorised session cookie that needs to be sent on subsequent requests to maintain the session.

You can get the cookie like this:

Connection.Response res = Jsoup.connect("http://www.example.com/login.php")
    .data("username", "myUsername", "password", "myPassword")
    .method(Method.POST)
    .execute();

Document doc = res.parse();
String sessionId = res.cookie("SESSIONID"); // you will need to check what the right cookie name is

And then send it on the next request like:

Document doc2 = Jsoup.connect("http://www.example.com/otherPage")
    .cookie("SESSIONID", sessionId)
    .get();
Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47
  • @Jonathan Hedley since you created JSoup and It is very helpful.Please help me with this http://stackoverflow.com/questions/20908946/jsoup-adding-extra-encoded-stuff-for-an-html.There are addition &lt &gt encoding at end of iframe no matter what I do.Thanks Swaraj – Swaraj Chhatre Jan 03 '14 at 19:05
  • 1
    but how to get HttpOnly cookies ? – iAmLearning Jan 06 '14 at 10:27
  • Please clarify how to check "// you will need to check what the right cookie name is" – vikramvi Sep 02 '22 at 07:10
  • res.cookies(); gave me {pgv=2, alp=VROL, aa=364476%7C813843989%7C196006251, arl=205118481, currency=INR, magnitude=LC, PERMA-ALERT=0, alo=deleted} but when passing this as parameter to another url; it's not all getting logged int – vikramvi Sep 02 '22 at 08:47
  • your solutions is not working, can you please check https://stackoverflow.com/questions/73572751/jsoup-login-cookies-are-not-working-with-sub-pages-but-only-with-home-page – vikramvi Sep 02 '22 at 08:48
  • I'm getting 2 set of cookies as per https://stackoverflow.com/questions/35000911/login-to-a-website-using-jsoup-and-stay-on-the-site, which one I should use for sub-page ? {PHPSESSID=iulmm77ir2ckid32euviubb0ad, currency=INR, magnitude=LC, ad=d8eedce649453f70f2ae83d02ef48a32426bfc9d, wec=291312787, nobtlgn=626283907, pgv=1} {pgv=2, alp=VROL, aa=364476%7C244856479%7C600763075, arl=847554676, currency=INR, magnitude=LC, PERMA-ALERT=0, alo=deleted} – vikramvi Sep 02 '22 at 10:55
19
//This will get you the response.
Response res = Jsoup
    .connect("loginPageUrl")
    .data("loginField", "login@login.com", "passField", "pass1234")
    .method(Method.POST)
    .execute();

//This will get you cookies
Map<String, String> loginCookies = res.cookies();

//And this is the easiest way I've found to remain in session
Document doc = Jsoup.connect("urlYouNeedToBeLoggedInToAccess")
      .cookies(loginCookies)
      .get();
Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47
  • Its not working now. I am struggling to login and scrap a facebook account. Now, facebook introduces some more parameters. lsd:AVptuGRS email:*** pass:*** default_persistent:0 timezone:-120 lgnrnd:043627_eQnN lgnjs:1383914188 locale:en_US Check this link: http://stackoverflow.com/questions/19851747/login-facebook-via-jsoup – Vishwajit R. Shinde Sep 10 '15 at 15:03
  • hey man, i did it like you said. but i am not getting the web page of "urlYouNeedToBeLoggedInToAccess". please answer me. – Kumaresan Perumal May 05 '16 at 13:34
  • Not working for me. `org.jsoup.HttpStatusException: HTTP error fetching URL. Status=400,` – Avinash Dec 13 '17 at 08:08
  • your solutions is not working, can you please check https://stackoverflow.com/questions/73572751/jsoup-login-cookies-are-not-working-with-sub-pages-but-only-with-home-page – vikramvi Sep 02 '22 at 07:11
1

Where the code was:

Document doc = Jsoup.connect("urlYouNeedToBeLoggedInToAccess").cookies().get(); 

I was having difficulties until I changed it to:

Document doc = Jsoup.connect("urlYouNeedToBeLoggedInToAccess").cookies(cookies).get();

Now it is working flawlessly.

Lee Taylor
  • 7,761
  • 16
  • 33
  • 49
0

Here is what you can try...

import org.jsoup.Connection;


Connection.Response res = null;
    try {
        res = Jsoup
                .connect("http://www.example.com/login.php")
                .data("username", "your login id", "password", "your password")
                .method(Connection.Method.POST)
                .execute();
    } catch (IOException e) {
        e.printStackTrace();
    }

Now save all your cookies and make request to the other page you want.

//Store Cookies
cookies = res.cookies();

Making request to another page.

try {
    Document doc = Jsoup.connect("your-second-page-link").cookies(cookies).get();
}
catch(Exception e){
    e.printStackTrace();
}

Ask if further help needed.

iamvinitk
  • 165
  • 3
  • 15
  • your solutions is not working, can you please check https://stackoverflow.com/questions/73572751/jsoup-login-cookies-are-not-working-with-sub-pages-but-only-with-home-page – vikramvi Sep 02 '22 at 08:44
0
Connection.Response res = Jsoup.connect("http://www.example.com/login.php")
    .data("username", "myUsername")
    .data("password", "myPassword")
    .method(Connection.Method.POST)
    .execute();
//Connecting to the server with login details
Document doc = res.parse();
//This will give the redirected file
Map<String,String> cooki=res.cookies();
//This gives the cookies stored into cooki
Document docs= Jsoup.connect("http://www.example.com/otherPage")
    .cookies(cooki)
    .get();
//This gives the data of the required website
Sandesh
  • 11
  • 4
  • 3
    Welcome to SO. Please read [how-to-answer](https://stackoverflow.com/help/how-to-answer) before posting an answer. What does that block of code mean? – fjsv Jun 12 '20 at 14:30
  • While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – β.εηοιτ.βε Jun 12 '20 at 19:32
0

Why reconnect? if there are any cookies to avoid 403 Status i do so.

                Document doc = null;
                int statusCode = -1;
                String statusMessage = null;
                String strHTML = null;
        
                try {
    // connect one time.                
                    Connection con = Jsoup.connect(urlString);
    // get response.
                    Connection.Response res = con.execute();        
    // get cookies
                    Map<String, String> loginCookies = res.cookies();

    // print cookie content and status message
                    if (loginCookies != null) {
                        for (Map.Entry<String, String> entry : loginCookies.entrySet()) {
                            System.out.println(entry.getKey() + ":" + entry.getValue().toString() + "\n");
                        }
                    }
        
                    statusCode = res.statusCode();
                    statusMessage = res.statusMessage();
                    System.out.print("Status CODE\n" + statusCode + "\n\n");
                    System.out.print("Status Message\n" + statusMessage + "\n\n");
        
    // set login cookies to connection here
                    con.cookies(loginCookies).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0");
        
    // now do whatever you want, get document for example
                    doc = con.get();
    // get HTML
                    strHTML = doc.head().html();

                } catch (org.jsoup.HttpStatusException hse) {
                    hse.printStackTrace();
                } catch (IOException ioe) {
                    ioe.printStackTrace();
                }