0

I’m trying to retrieve html content from a php script using pythons’ requests library. The script resides in my local Apache server and I access it directly on: http://localhost/aaa/index.php

The scripts’ content is:

<?php
    $headers = json_encode(apache_request_headers());
?>

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>The Title</title>
  <meta name="description" content="The Title">
</head>

<body>
  <?php echo json_encode($headers); ?>
</body>
</html>

The direct access of the above script produces the following response:

<head>
  <meta charset="utf-8">
  <title>The Title</title>
  <meta name="description" content="The Title">
</head>

<body>
"{\"Host\":\"localhost\",\"User-Agent\":\"Mozilla\\\/5.0 (Windows NT 6.3; WOW64; rv:42.0) Gecko\\\
/20100101 Firefox\\\/42.0\",\"Accept\":\"text\\\/html,application\\\/xhtml+xml,application\\\/xml;q=0
.9,*\\\/*;q=0.8\",\"Accept-Language\":\"en-US,en;q=0.5\",\"Accept-Encoding\":\"gzip, deflate\",\"Cookie
\":\"menu=users%3Bconfiguration; fieldset=; PHPSESSID=tn82odn5hdtr45mw0bkd6rhf56; nr
=5c3ab462abb1d3364b8ba59fa4d8b7f6; ru=popopo; rp=64864wb5630986rgn5860f52vy0614909b8a8736
\",\"Connection\":\"keep-alive\",\"Cache-Control\":\"max-age=0\"}"
</body>
</html>

When I access the above url [http://localhost/aaa/index.php] using Python, I get a different response.

The Python code:

import requests

url = "http://localhost/aaa/index.php"

headers = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'gzip, deflate',
           'Accept-Language': 'en-US,en;q=0.5',
           'Connection': 'Keep-Alive',
           'Content-Type': 'text/html; charset=UTF-8'}

req = requests.get(url, headers=headers)

print("Body :::", req.content)

And the response:

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>The Title</title>
  <meta name="description" content="The Title">
</head>

<body>
  "{\\"Host\\":\\"localhost\\",\\"Accept-Encoding\\":\\"gzip, 
  deflate\\",\\"Accept-Language\\":\\"en-US,en;q=0.5\\",
  \\"Accept-Charset\\":\\"ISO-8859-1,utf-8;q=0.7,*;q=0.3\\",
  \\"User-Agent\\":\\"Mozilla\\\\\\/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident
  \\\\\\/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)\\",\\"Accept\\":\\"text\\\\\\/html,application
  \\\\\\/xhtml+xml,application\\\\\\/xml;q=0.9,*
  \\\\\\/*;q=0.8\\",\\"Connection\\":\\"Keep-Alive
  \\",\\"Content-Type\\":\\"text\\\\\\/html; charset=UTF-8\\"}"
</body>
</html>

Notice that "Cookie" is missing when I request the resource with Python. The cookie is what I actually want to retrieve. I need it, in order to read the content from other php pages.

I also had tried the following with no success:

import requests

url = "http://localhost/aaa/index.php"

session = requests.Session()
session.cookies.get_dict()

response = session.get(url, headers=headers)
print("Cookies :::", session.cookies.get_dict())

Is there any way to accomplice that?

  • Have you tried looking at `req.cookies`? – Morgan Thrapp Dec 10 '15 at 20:33
  • @Morgan Thrapp Yes, and the cookiejar is empty –  Dec 10 '15 at 20:34
  • 1
    Possible duplicate of [python requests get cookies](http://stackoverflow.com/questions/25091976/python-requests-get-cookies) – Darth Vader Dec 10 '15 at 20:34
  • @Darth Vader I had tried that, and I get empty cookiejar –  Dec 10 '15 at 20:36
  • _"The direct access of the above script produces the following response:"_ ... not quite. You did something along the line to have the server send you that cookie and now your browser is sending it back with each request. If you cleared all cookies from your browser for localhost, you wouldn't see it any more. – tdelaney Dec 10 '15 at 20:40
  • @tdelaney Yes, it's true. But I'm still doing it ["You did something..."] while I'm calling the Python script. –  Dec 10 '15 at 20:53
  • What is the "it"? Is there a form login? The trick is to create a `mysession = requests.Session()` and then do the interaction (including posting the login form) with `mysession` methods instead of `requests` methods. Requests will fill in the cookie for you. – tdelaney Dec 10 '15 at 20:57
  • @tdelaney No, there is not any form. It's a php script that checks is a user is logged in. In order to read the content from that script I need the PHPSESSID –  Dec 10 '15 at 21:07
  • The php code is just sending back the cookies it gets from the client. As far as your python script is concerned, the user isn't logged in even if he "is logged in" in the browser. Your python script is a different client. If you are trying to get the browser's session id from the python script, it won't work. – tdelaney Dec 10 '15 at 21:11
  • @tdelaney Is there any workaround? Any suggestions? –  Dec 10 '15 at 21:16
  • php stores that data somewhere and you may be able to grab it somehow on the server side. But that would be a huge security hole. The web server tries hard to keep your session cookie secret so that your session can't be hijacked. Other programs shouldn't be able to get it. here's hoping there isn't a workaround! – tdelaney Dec 10 '15 at 21:25
  • @tdelaney The funny thing is that I have done this with ColdFusion. What I'm doing now is converting ColdFusion code to Python. –  Dec 10 '15 at 21:29

2 Answers2

0

Your browser is adding the "cookie" HTTP header to the request to your PHP code, so your PHP code returns it (as per your code). That's what browsers do: accept setting of cookies then return them in requests. Python doesn't.

Your Python program is not sending a "cookie" HTTP header, so your PHP code is not returning it. Your Python only sends 'User-Agent', 'Accept', 'Accept-Charset', 'Accept-Encoding', 'Accept-Language', 'Connection', and 'Content-Type'. But no 'Cookie'.

No 'Cookie' sent means no 'Cookie' for you :)

BareNakedCoder
  • 3,257
  • 2
  • 13
  • 16
0
  1. How PHP sessions work.

PHP tracks your users using cookies, but only if you're using sessions. Whenever you start a session, PHP will check user's request to see if he's already got a cookie with session id defined (named PHPSESSID by default). If there is no cookie in the request, PHP will generate a new session id. Either way, it will send a cookie back to client in response, so next time user accesses this or another page, this cookie with unique session id will be present.

  1. What happens when you don't use sessions.

Nothing. PHP doesn't automatically check request for cookies. However any cookies that had been set earlier remain active (until they're expired). So if your browser received a PHPSESSID cookie in the past and it's not yet expired, it will keep it and will continue sending it to server with each request. So your code will be able to retrieve and print it in the output.

  1. What happens when you're sending a request from Python script.

Nothing unless you request it. If you don't tell Python to send cookies to server, it won't. Since it won't send any cookies, PHP script will not receive any. And since PHP script is not starting a session anywhere in the code, it will not automatically create any, either.

  1. How to solve it.

You can start a session in your PHP script. Then it will always generate a cookie and send it with response. Note however that it will not allow your Python script to join some session that you might have started in your browser, since a new session id will be generated. To join an existing session, you will need to retrieve PHPSESSID cookie from your browser's data, and this data is usually encrypted to protect your cookies from malicious programs (and even Python scripts).

  1. Conclusion.

In your php code, write at the very beginning, before any output:

session_start()

Well, at least that was a solution a few years ago. I don't know the latest PHP fashions in session handling.

Lav
  • 2,204
  • 12
  • 23