8

In my python application I have to read many web pages to collect data. To decrease the http calls I would like to fetch only changed pages. My problem is that my code always tells me that the pages have been changed (code 200) but in reality it is not.

This is my code:

from models import mytab
import re
import urllib2
from wsgiref.handlers import format_date_time
from datetime import datetime
from time import mktime

def url_change():
    urls = mytab.objects.all()
    # this is some urls:
    # http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews
    # http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel
    # http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews
    # http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/
    # http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews
    # ...

    for url in urls:
        request = urllib2.Request(url.url)
        if url.last_date == None:
            now = datetime.now()
            stamp = mktime(now.timetuple())
            url.last_date = format_date_time(stamp)
            url.save()

        request.add_header("If-Modified-Since", url.last_date)

        try:
            response = urllib2.urlopen(request) # Make the request
            # some actions
            now = datetime.now()
            stamp = mktime(now.timetuple())
            url.last_date = format_date_time(stamp)
            url.save()
        except urllib2.HTTPError, err:
            if err.code == 304:
                print "nothing...."
            else:
                print "Error code:", err.code 
                pass

I do not understand what has gone wrong. Can anyone help me?

Patricio Molina
  • 390
  • 1
  • 7
RoverDar
  • 441
  • 2
  • 12
  • 32
  • Did you consider the fact that a web-page may must lie about dates? –  Mar 04 '13 at 17:25
  • @princess-of-the-universe No, I have not considered this. So what can be done to check if a page has changed? I also tried with 'hash' but the page changes each time I load it. – RoverDar Mar 04 '13 at 17:35

2 Answers2

5

Web servers aren't required to send a 304 header as the response when you send an 'If-Modified-Since' header. They're free to send a HTTP 200 and send the entire page again.

Sending a 'If-Modified-Since' or 'If-None-Since' alerts the server that you'd like a cached response if available. It's like sending an 'Accept-Encoding: gzip, deflate' header -- you're just telling the server you'll accept something, not requiring it.

Jonathan Vanasco
  • 15,111
  • 10
  • 48
  • 72
  • Thanks. What can I use to check if a page has changed? – RoverDar Mar 04 '13 at 17:36
  • 3
    The easiest would be to fingerprint each one with a MD5 hash, and store that locally to compare. BUT the problem with that , is that while the "main" content is unchanged, the "ancillary" content has changed -- different ad tags, 'promoted stories', 'recommended links', 'partner links' etc. Even a timestamp on the page will throw off the md5. – Jonathan Vanasco Mar 04 '13 at 17:48
  • It may be helpful to take only for example? – RoverDar Mar 04 '13 at 17:56
  • In my case I can not consider the whole page but only the part that I want to collect data (eg the review section). On Part I calculate the hash and store it locally. Is that right? – RoverDar Mar 04 '13 at 18:03
  • Yeah. Create a database with "url|timestamp_accessed|hash" and then query for the hash of the latest timestamp_accessed. if its different, you've got new content. if you're only using those 5 sites, you can use BeautifulSoup to figure out how to isolate only the sections you want. – Jonathan Vanasco Mar 04 '13 at 18:21
0

A good way to check if a site returns 304 is to use google chromes dev tools. E.g. below is an annotated example of using chrome on the bls website. Keep refreshing and you will see that the server keeps returning 304. If you force refresh with Ctrl+F5 (windows), you will see that instead it returns status code 200.

You can use this technique on your example to find out if the server does not return 304, or if you have incorrectly formatted your request headers somehow. Sometimes a webpage has a resource imported on to it which does not respect the If- headers and so it returns 200 whatever you do (If any resource on the page does not return 304, the whole page will return 200), but sometimes you are only looking at a specific part of a website and you can cheat by loading the resource directly and bypassing the whole document.

phil_20686
  • 4,000
  • 21
  • 38