Best way to download all artifacts listed in an HTML5 cache.manifest file?

Question

I am attempting to look at how an HTML5 app works and any attempts to save the page inside the webkit browsers (chrome, Safari) includes some, but not all of the cache.manifest resources. Is there a library or set of code that will parse the cache.manifest file, and download all the resources (images, scripts, css)?

(original code moved to answer... noob mistake >.<)

There is a lack of "questioness" in your question, but the code looks working Python, though some parts could be simplified. Also there are libraries called urlgrabber and requests which could make the file saving process easier. — Mikko Ohtamaa, Sep 13 '11 at 00:39
Thank you for the feedback Mikko I will check the libraries you mentioned for further development. So basically, you don't know of a library to download the list of resources inside a cache.manifest file. Unfortunately, the keyword that matters most in this post is "cache.manifest" which evidently isn't added as a keyword yet. Not having a score of 1500 I can't add it. >.< Unfortunately this question will be viewed by people who are watching "python" tags instead of people who are interested in HTML5 "cache.manifest". — rockhowse, Sep 13 '11 at 13:48
Because parsing and downloading cache manifest is only 50 lines of Python code I don't see why anyone should build a niche library for just that purpose :) — Mikko Ohtamaa, Sep 13 '11 at 18:23
This is true =P Considering "cache.manifest" isn't a keyword, I doubt a lot of people are actually using this piece of HTML5 in their websites, or if they are they don't need to download the content from the site because they are the ones who created the file. It's of more interest to people who are trying to analyze an existing site. I could see it being useful as an HTML5 crawler app or something along those lines. This level of local caching is different than the standard browser or proxy caching you are used to when dealing with HTTP. — rockhowse, Sep 13 '11 at 18:57

score 0 · Accepted Answer · answered Dec 21 '11 at 20:17

I originally posted this as part of the question... (no newbie stackoverflow poster EVER does this ;)

since there was a resounding lack of answers. Here you go:

I was able to come up with the following python script to do so, but any input would be appreciated =) (This is my first stab at python code so there might be a better way)

import os
import urllib2
import urllib

cmServerURL = 'http://<serverURL>:<port>/<path-to-cache.manifest>'

# download file code taken from stackoverflow
# http://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python
def loadURL(url, dirToSave):
        file_name = url.split('/')[-1]
        u = urllib2.urlopen(url)
        f = open(dirToSave, 'wb')
        meta = u.info()
        file_size = int(meta.getheaders("Content-Length")[0])
        print "Downloading: %s Bytes: %s" % (file_name, file_size)

        file_size_dl = 0
        block_sz = 8192
        while True:
                buffer = u.read(block_sz)
                if not buffer:
                        break

                file_size_dl += len(buffer)
                f.write(buffer)
                status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
                status = status + chr(8)*(len(status)+1)
                print status,

        f.close()

# download the cache.manifest file
# since this request doesn't include the Conent-Length header we will use a different api =P
urllib.urlretrieve (cmServerURL+ 'cache.manifest', './cache.manifest')

# open the cache.manifest and go through line-by-line checking for the existance of files
f = open('cache.manifest', 'r')
for line in f:
        filepath = line.split('/')
        if len(filepath) > 1:
                fileName = line.strip()
                # if the file doesn't exist, lets download it
                if not os.path.exists(fileName):
                                print 'NOT FOUND: ' + line
                                dirName = os.path.dirname(fileName)
                                print 'checking dirctory: ' + dirName
                                if not os.path.exists(dirName):
                                        os.makedirs(dirName)
                                else:
                                        print 'directory exists'
                                print 'downloading file: ' + cmServerURL + line,
                                loadURL (cmServerURL+fileName, fileName)

Best way to download all artifacts listed in an HTML5 cache.manifest file?

1 Answers1