42

I'm trying to use NodeJS to scrape a website that requires a login by POST. Then once I'm logged in I can access a separate webpage by GET.

The first problem right now is logging in. I've tried to use request to POST the login information, but the response I get does not appear to be logged in.

exports.getstats = function (req, res) {
    request.post({url : requesturl, form: lform}, function(err, response, body) {
        res.writeHeader(200, {"Content-Type": "text/html"});
        res.write(body);
        res.end();
    });
};

Here I'm just forwarding the page I get back, but the page I get back still shows the login form, and if I try to access another page it says I'm not logged in.

I think I need to maintain the client side session and cookie data, but I can find no resources to help me understand how to do that.


As a followup I ended up using zombiejs to get the functionality I needed

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
Ryan
  • 423
  • 1
  • 4
  • 6
  • Can you post your example using zombiejs? I'm stuck with your same problem, session is not kept by node.js using request. – Ñhosko Aug 01 '18 at 14:51

3 Answers3

48

You need to make a cookie jar and use the same jar for all related requests.

 var cookieJar = request.jar();
 request.post({url : requesturl, jar: cookieJar, form: lform}, ...

That should in theory allow you to scrape pages with GET as a logged-in user, but only once you get the actual login code working. Based on your description of the response to your login POST, that may not be actually working correctly yet, so the cookie jar won't help until you fix the problems in your login code first.

Peter Lyons
  • 142,938
  • 30
  • 279
  • 274
  • This helped a lot. I'm still being unsuccessful logging in like you said. I can't seen to figure out what's missing. I'm using POST and adding the form data but for some reason the php server I connect to never reply with the "logged in" page. I'm going to keep working on it – Ryan Nov 12 '13 at 20:39
  • 2
    Yes that depends on their server side code which you are basically reverse engineering. Are you sure you are handling CSRF tokens if they use them? Do they perhaps sniff user agent? Do you have to GET / to get a session cookie first then make sure your POST to login includes that cookie? Etc. – Peter Lyons Nov 12 '13 at 20:45
  • I added headers for user agent, but I think they use client side scripts for logging in. In one of my responses I read this "Please Enable Scripts To Log In". I'm stuck now :( – Ryan Nov 12 '13 at 21:00
  • Use chrome Dev tools to examine the http requests directly and try to simulate that in your code. – Peter Lyons Nov 12 '13 at 21:11
  • That's what I've been doing so far, and it's been really useful, but I think that there's some javascript shenanigans for the login system. I'm going to rethink my approach. – Ryan Nov 12 '13 at 21:21
  • One might need to look at https://github.com/request/request#requestjar to understand this answer. @PeterLyons seems not to have explained request jar accurately but after studying the link above and converting my request to var query = request.defaults({jar: true}); then using query to make my post solved things for me. – Kennedy Nyaga Jan 19 '17 at 19:35
15

The request.jar(); didn't work for me. So I am using the headers response to make another request like this:

request.post({
    url: 'https://exampleurl.com/login',
    form: {"login":"xxxx", "password":"xxxx"}
}, function(error, response, body){

    request.get({
        url:"https://exampleurl.com/logged",
        header: response.headers
    },function(error, response, body){
        // The full html of the authenticated page
        console.log(body);
    });
});

Actualy this way is working fine. =D

sampathsris
  • 21,564
  • 12
  • 71
  • 98
Henrique Rotava
  • 781
  • 1
  • 7
  • 13
  • Great tip. Solved my similar case where just sending the cookie was not enough. – Bob Aug 08 '15 at 17:28
1

Request manages cookies between requests if you enable it:

Cookies are disabled by default (else, they would be used in subsequent requests). To enable cookies, set jar to true (either in defaults or options).

const request = request.defaults({jar: true})
request('http://www.google.com', function () {
  request('http://images.google.com')
});
Stephane
  • 4,978
  • 9
  • 51
  • 86