0

Note: I've replaced the last 5 chars of the session IDs with 'x's for obvious reasons

I'm scraping a web site. I can see, in the browser, that logging in sets a cookie value called PHPSESSID. No problem, I can scrape that:

superagent
    .post(loginUrl)
    .send(loginDetails)
    .end(function(err, res){
        var setCookieValue = res.headers['set-cookie'][0]
        var sessionID = cookieParser.parse(setCookieValue).PHPSESSID
        console.log(sessionID)

Returns:

37c3bog3tf6erp2i6ss5vxxxxx

Which looks like a PHP session ID. Great! Now to use the session ID:

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID'=sessionID)
.end(err, res)

Redirects me to the login page. But the session ID I got manually from the browser, in the exact same format, works fine:

var fakeSessionID = 'a1oslk341uoht8p6009q5xxxxx'
superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+fakeSessionID)

Will return the loggedInURL, with the full HTML of a logged in user.

Why isn't the session ID I'm scraping working?

  • The format is identical
  • The character count is the same (26 characters)

There is nothing asides from the session ID that's different between the working and non-working code.

What could be making the difference?

mikemaccana
  • 110,530
  • 99
  • 389
  • 494

3 Answers3

2

PHP has some dubious extra security for sessions such as checking Referer.

Some sites may additionally check User-Agent.

Community
  • 1
  • 1
Kornel
  • 97,764
  • 37
  • 219
  • 309
  • I have switched to using superagent's persistent option, eg, `var agent = superagent.agent()`. This takes care of referers, cookies, and other persistence-related matters for me. Since the answer above was the closest to that solution I'm marking it as correct and giving you this emoji corn: – mikemaccana Jun 22 '15 at 14:14
1

You might try throwing a different user-agent attribute in the header in the call to superagent for both GET and POST:

  .set('User-Agent','Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0')
Michael Blankenship
  • 1,639
  • 10
  • 16
0

You code looks like you aren't replacing the string "sessionID" with the actual sessionID value...

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID=sessionID')
.end(err, res)

Should be something like?

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+sessionID)
.end(err, res)

I think...

Matt Fellows
  • 6,512
  • 4
  • 35
  • 57
  • You're right, but that was an error made when making a test case for StackOverflow. My real code has the quote in the right place. Thanks for pointing this out though. – mikemaccana Jun 21 '15 at 22:40