1

When i request a website with the python module requests, i don't get a webpage which is up to date but a cached website.

As far as i know there should be no caching with requests or am I wrong ?

finanzennet_request = requests.get('http://finanzen.net/aktien/Tesla-Aktie')
print(finanzennet_request)

Yields the following result

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!-- CacheEngine generated: 87039 chars in 0,0313 seconds on 26.08.2015 21:39:07 from NT -->

As you can see it says "CacheEngine generated...." . Can it really be that the webserver recognizes that my script is not a real user and therefore only gives me a cached version ? If so how can i avoid it ?

KoKlA
  • 898
  • 2
  • 11
  • 15

2 Answers2

1

When troubleshooting what you think may be script-related behavior in requesting webpages, check in a browser before assuming that something like the user-agent or headers leads to a different response from the remote web server.

The URL that you've specified returns that 'CacheEngine' line for me in Chrome, Safari, and Firefox.

When you come across a page that does actually respond with different content for requests, I'd suggest first looking into setting your User-Agent field. While you can request that the remote and not cache content by specifying:

{'cache-control': ' private, max-age=0, no-cache'}

in the headers, keep in mind that this is only a request to the remote webserver not to serve you cached content.

For a total request while pretending to be a browser asking for non-cache content, this would look like this:

url='http://finanzen.net/aktien/Tesla-Aktie'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36',
    'cache-control': 'private, max-age=0, no-cache'
}
response = requests.get(url, headers=headers)
Community
  • 1
  • 1
  • @KoKlA Then it seems as if you got a request that wasn't touched by the remote server's cache. Keep in mind that you cannot control the remote server, but you can make you `requests.get` ask as if it was a browser (as per my answer). –  Aug 26 '15 at 20:27
  • 2
    I did check before. See this link [Chrome](http://postimg.org/image/do84les6n/) and it doesn't contain this line. Anyway this wasn't how i recognized that it's a cached website. I simply get a share price from a few days ago not from today. Take a look at this comparison [Via Requests](http://s21.postimg.org/fbs9733w7/Request2.png) [Via Chrome](http://s12.postimg.org/4hi986rwt/Request3.png) I also tried the headers you supplied but it didn't work. But good idea – KoKlA Aug 26 '15 at 20:34
  • @KoKlA if the two are different and that site is giving you different responses based on `requests` v. Chrome, see my above answer for steps on how to make `requests` appear to be the Chrome browser. –  Aug 26 '15 at 20:37
  • @KoKlA I would suggest getting the headers you're sending from Chrome and replicating those entries in the headers you set in requests [in response to your edit]. –  Aug 26 '15 at 20:45
  • 1
    Ok i added all headers from Chrome it still doesn't work. This is how the two requests look (intercepted them with burp) [Comparison Requests vs Chrome](http://s16.postimg.org/ioqaostbo/Requests.jpg) . Apart from the order of the headers they are the same.I even added the Cookie .I personally don't think that it's related to the headers but that they(the website) uses a different mechanism to determine if it's a real user or not. But that's just a guess – KoKlA Aug 26 '15 at 21:34
1

Nearly two years later, going through my stackoverlow questions, i found the issue.

I didn't see it at the time but if you open Comparison Requests vs Chrome which I previously posted as a comment to user559633 you can see that these two are actually different domains. With chrome I was accessing finanzen.at and with requests finanzen.net

So long story short: It was a mistake on my site and not requests caching the website or the webserver altering the reponse based on user detection.

KoKlA
  • 898
  • 2
  • 11
  • 15
  • That doesn't solve the issue for me. I'm visiting the same URL, I even save requests web page, so I can compare them, and requests is getting like 1 hour old version. I used random user agent (from like 10 choices). – Human programmer Dec 20 '21 at 21:21