4

So I have a very simple HTML page (a dir listing) and I try to read it with urllib, this way:

page =  urllib.urlopen(coreRepositoryUrl).read()

The problem is, that the HTML I read this way is older than the newest. info() returns me this:

Date: Fri, 19 Apr 2013 18:48:09 GMT
Server: Apache/2.0.52 (Fedora)
Content-Type: text/html; charset=UTF-8
Connection: close
Age: 481084

And the page was last updated today (2013-04-25). Which component might be the one that caches?

Charles
  • 50,943
  • 13
  • 104
  • 142
zeller
  • 4,904
  • 2
  • 22
  • 40
  • Could you add your link? `urlopen().info()` works as expected for me with _google.com_ ([PasteBin](http://pastebin.com/su6WuMJY)) – awesoon Apr 25 '13 at 08:55
  • @soon It's a local build server. (I can't reach pastebin behind the corporate proxy unfortunately...) But I just found a similar question with a disappointing answer... http://stackoverflow.com/questions/3586295/does-urllib2-urlopen-cache-stuff – zeller Apr 25 '13 at 08:59
  • 1
    `urllib` might use its own cache (under certain conditions, see [`tempcache`, `ftpcache` in `URLopener`](http://hg.python.org/cpython/file/09811ecd5df1/Lib/urllib.py#l139)) that is unrelated to http cache. `urllib.urlcleanup()` clears the cache. `urllib2` doesn't cache anything. – jfs Apr 25 '13 at 12:37

1 Answers1

3

Add the header "Cache-Control" with value "max-age=0" in your request

import urllib2
req = urllib2.Request(url)
req.add_header('Cache-Control', 'max-age=0')
resp = urllib2.urlopen(req)
content = resp.read()

Using that header each cache along the way will revalidate its cache entry

acj
  • 134
  • 4