0
$username = 'emp';
$pass = 'emp';

$login = array(
    'username' => $username,
    'password' => $pass
);

$loginUrl = 'http://demo.smartjobboard.com/login';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $loginUrl);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($login));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$content = curl_exec($ch);


echo $content;

I used smartjobboard.com as an example to test my code, what I got is the login page, why couldn't I get the login-ed page? I want to scrape content that requires user to login. The username and password is correct but have no idea why I can't get through.

Amy Johnson
  • 125
  • 1
  • 1
  • 8
  • You need to tell cURL to store and send the cookies. Right now, each of your requests starts with a brand new non-logged-in session. – ceejayoz Feb 06 '15 at 15:43

1 Answers1

2

Log in manually in the website and check what is exactly posted through the Browsers Network monitor. Maybe there is a simple typo in your parameters? You can open the Network monitor with F12 (Google CHrome or IE). Then start logging by pressing the appropriate button (make sure it preserves the log when a new page is loaded) and watch the entries roll by. Then login and see what is logged by opening the detailed view and watch the headers and response.

It is important that you start logging the HTTP requests before loading the login page. Sometimes a cookie is created before you login. That could give you a hint of what to send.

Remember that cookies need to be sent manually when not using a browser. So when you are logged on, remember to be sending additional information like cookies when using CURL.

Cookies are created but having a look at the network monitor is sends more parameters: return_url=&action=login&username=emp&password=emp

Try this:

<?php
$username = 'emp';
$pass = 'emp';

$login = array(
    'username' => $username,
    'password' => $pass,
    'action' =>  'login',
    'return_url' => '/my-account/'
);

$loginUrl = 'http://demo.smartjobboard.com/login';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $loginUrl);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($login));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');

$content1 = curl_exec($ch);

curl_setopt($ch, CURLOPT_URL, "http://demo.smartjobboard.com/my-account/");
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');

$content2 = curl_exec($ch);

curl_close($ch);

echo $content2;

?>

This works; try it from a command line if you can. However, a status 303 (see other locatoin) is returned. Retrieving cookies can be done using CURL's option CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE. Have a look at the manual.

So you need to manually do another curl call probably, sending the received cookie.

Notice the extra options to retrieve the full verbose headers to learn what's happening!

My response:

HTTP/1.1 303 See Other
Server: nginx
Date: Fri, 06 Feb 2015 15:53:16 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 0
Connection: keep-alive
Keep-Alive: timeout=35
X-Powered-By: PHP/5.3.28
Set-Cookie: PHPSESSID=b33b1a0bd7a3bcd50e5e73671c383182; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: PHPSESSID=baf0d249c8fd7795fa1234cbaf16995e; path=/
Location: http://demo.smartjobboard.com/my-account/

and

* Hostname was NOT found in DNS cache
*   Trying 96.30.31.40...
* Connected to demo.smartjobboard.com (96.30.31.40) port 80 (#0)
> POST /login HTTP/1.1
Host: demo.smartjobboard.com
Accept: */*
Content-Length: 66
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 66 out of 66 bytes
< HTTP/1.1 303 See Other
* Server nginx is not blacklisted
< Server: nginx
< Date: Fri, 06 Feb 2015 15:53:16 GMT
< Content-Type: text/html;charset=utf-8
< Content-Length: 0
< Connection: keep-alive
< Keep-Alive: timeout=35
< X-Powered-By: PHP/5.3.28
< Set-Cookie: PHPSESSID=b33b1a0bd7a3bcd50e5e73671c383182; path=/
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=baf0d249c8fd7795fa1234cbaf16995e; path=/
< Location: http://demo.smartjobboard.com/my-account/
< 
* Connection #0 to host demo.smartjobboard.com left intact

(location is a bit garbled, but don't know why). Redirect location = http://demo.smartjobboard.com/my-account/. But you should parse the output to detect this address, so it works for other locations as well.

And I learned something as well ;).

Pianoman
  • 327
  • 2
  • 10
  • I did everything except for the cookies right I think, can u test my code? I really have no idea what's wrong here. – Amy Johnson Feb 06 '15 at 15:07
  • I've tested your code (from Ubuntu command line using php executable) and I get a HUGE HTML file. When running it from the browser I see a few cookies appearing after login. But you're missing some parameters. Looking at the network log it says: `return_url=&action=login&username=emp&password=emp` – Pianoman Feb 06 '15 at 15:37
  • Amended my answer; hope this helps. – Pianoman Feb 06 '15 at 15:49
  • so am I missing return url and action? So what should be the solution? – Amy Johnson Feb 06 '15 at 15:56
  • Improved answer again. Please have a look at the new source code and the result of the CURL call. – Pianoman Feb 06 '15 at 15:58
  • I can see those in my chrome inspect element, what am I missing? – Amy Johnson Feb 06 '15 at 16:01
  • You need to do TWO calls through CURL. First is to login and retrieve the cookie.Second is to open the page indicated in the answer to your first call and send the cookie back. – Pianoman Feb 06 '15 at 16:06
  • How? sorry I'm new to php. – Amy Johnson Feb 06 '15 at 16:07
  • Oke, final try. I adapted the answer to a working solution now (tested it on my Ubuntu machine). So please mark as answer. Enjoy! – Pianoman Feb 06 '15 at 19:28
  • Your answer make sense but I saw someone didn't execute twice to get the logged in content. – Amy Johnson Feb 06 '15 at 23:43
  • question, cookies.txt, where does that come from? – Amy Johnson Feb 06 '15 at 23:43
  • The cookies.txt file is a that is created by the curl_exec() to store the received cookies. In the next call, the cookies are taken from this file to send it back. Just like a browser in fact. So please, have a look at the manual for PHP Curl. – Pianoman Feb 07 '15 at 19:55