urllib.urlopen returns an old page?

Question

So I have a very simple HTML page (a dir listing) and I try to read it with urllib, this way:

page =  urllib.urlopen(coreRepositoryUrl).read()

The problem is, that the HTML I read this way is older than the newest. info() returns me this:

Date: Fri, 19 Apr 2013 18:48:09 GMT
Server: Apache/2.0.52 (Fedora)
Content-Type: text/html; charset=UTF-8
Connection: close
Age: 481084

And the page was last updated today (2013-04-25). Which component might be the one that caches?

Could you add your link? `urlopen().info()` works as expected for me with _google.com_ ([PasteBin](http://pastebin.com/su6WuMJY)) — awesoon, Apr 25 '13 at 08:55
@soon It's a local build server. (I can't reach pastebin behind the corporate proxy unfortunately...) But I just found a similar question with a disappointing answer... http://stackoverflow.com/questions/3586295/does-urllib2-urlopen-cache-stuff — zeller, Apr 25 '13 at 08:59
`urllib` might use its own cache (under certain conditions, see [`tempcache`, `ftpcache` in `URLopener`](http://hg.python.org/cpython/file/09811ecd5df1/Lib/urllib.py#l139)) that is unrelated to http cache. `urllib.urlcleanup()` clears the cache. `urllib2` doesn't cache anything. — jfs, Apr 25 '13 at 12:37

score 3 · Accepted Answer · answered Apr 25 '13 at 13:08

Add the header "Cache-Control" with value "max-age=0" in your request

import urllib2
req = urllib2.Request(url)
req.add_header('Cache-Control', 'max-age=0')
resp = urllib2.urlopen(req)
content = resp.read()

Using that header each cache along the way will revalidate its cache entry

1 Answers1