2

I'm looking for a way to get the cookies from a website using a webcrawler. I have looked into the situation but I cant quite find the right library / way to target my problem. I'm not looking to get the cookies from the browser, I'm trying to get them from the website (any website).

I hope someone can provide me with the right solution / library! Thank you in advance!

Kind regards, Mike

Edit: I know there is a similar post about this crawler library(s). But that post is outdated, it's from 2011.

mvandiepen
  • 67
  • 12
  • 2
    You simply parse the Set-Cookie response header. Questions for tools/libraries are off-topic though. Pick one yourself, and come back if you have trouble with it. – Peter Dec 18 '18 at 16:59
  • All right, thank you @Peter ! I'll look into it =) – mvandiepen Dec 18 '18 at 17:03

1 Answers1

1

You can get cookies with php-curl with just something like this script:

<?php    

// The url to visit
$url = "https://www.google.com";

// Where to read cookies from and where to write them
$cookiesFile = "cookies.txt";

// Setup
$handle = curl_init();

curl_setopt( $handle, CURLOPT_URL,              $url );
curl_setopt( $handle, CURLOPT_RETURNTRANSFER,   true );
curl_setopt( $handle, CURLOPT_FOLLOWLOCATION,   true );

// Send cookies upon request and update them as per response
curl_setopt( $handle, CURLOPT_COOKIEFILE,       $cookiesFile );
curl_setopt( $handle, CURLOPT_COOKIEJAR,        $cookiesFile );

// Send request, get response
$response = curl_exec( $handle );

// Done with curl
curl_close( $handle );

What you get inside cookies.txt file looks like...

# Netscape HTTP Cookie File
# http://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

.google.com TRUE    /   FALSE   1547833930  1P_JAR  2018-12-19-17
#HttpOnly_.google.com   TRUE    /   FALSE   1561053130  NID 150=OF8rpPblfIZCnga6aoN_Zo6_H9nv87Th7ggQZDijf76GJ11ZDkWXmQXEQ9cUOBC3z7vY_Ea0-NtGcK5wi8Qo3myU1nnNksfgTreuIHJRiI0-pEqN9v4H7YGafp6r0RFHFueUbJ9IWo3Bu83Sh3akVW6bXzY2I-rJvaIIGoW9Fdg

Cookies are stored in a specific format called Netscape HTTP Cookies file, you may look at this question and related answers for more details.


When making subsequent requests with the above code cookies previously received are read from the cookies file and sent with the request. When the response is received if a cookie is updated the file is updated too.

This is important because as you visit with php-curl more pages of the same website the cookies storage is maintained consistent. Think as a example to session cookies.

The above code stores inside $response the HTTP code of the page visited.


Note that if you just need to visit a couple of page and get cookies that is an easy job that can be accomplished with just the code shown at the beginning.

It can be easily adjusted even to make a POST request in case you need to send data as when the user compiles a form and submits it.

If you need to scrape an entire site things may not be so trivial.

Finally take in account that cookies may be set by JavaScript code.

If you need to visit an interactive / JavaScript-rich site and simulate user interaction and then inspect cookies php-curl is not suitable for that. You would need to script a headless browser.

Paolo
  • 15,233
  • 27
  • 70
  • 91