0

I'd like to save a webpage for offline viewing, exactly like when you do it from Firefox's menu File > Save Page As ... which saves complete webpage, with all static content in a subfolder.

I found the code below from one of stack overflow answers. When I run it, it downloads a portion of the page, but does not finish completely. I've compared it with what Firefox has downloaded and I see that FF does a better job of saving almost complete webpage.

So what is the best way to download a complete webpage (just the main level, not following the links), along with the necessary static content (css, js, other html)?

I've also seen some suggest to use Selenium in python. But it doesn't work for me as I intend to do this for a large number of pages.

Thanks in advance!

def saveFullHtmlPage(url, pagepath='page', session=requests.Session(), html=None):
    """Save web page html and supported contents        
        * pagepath : path-to-page   
        It will create a file  `'path-to-page'.html` and a folder `'path-to-page'_files`
    """
    def savenRename(soup, pagefolder, session, url, tag, inner):
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag):   # images, css, etc..
            if res.has_attr(inner): # check inner tag (file object) MUST exists  
                try:
                    filename, ext = os.path.splitext(os.path.basename(res[inner])) # get name and extension
                    filename = re.sub('\W+', '', filename) + ext # clean special chars from name
                    fileurl = urljoin(url, res.get(inner))
                    filepath = os.path.join(pagefolder, filename)
                    # rename html ref so can move html and folder of files anywhere
                    res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                    if not os.path.isfile(filepath): # was not downloaded
                        with open(filepath, 'wb') as file:
                            filebin = session.get(fileurl)
                            file.write(filebin.content)
                except Exception as exc:
                    print(exc, file=sys.stderr)
    if not html:
        html = session.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    path, _ = os.path.splitext(pagepath)
    pagefolder = path+'_files' # page contents folder
    tags_inner = {'img': 'src', 'link': 'href', 'script': 'src'} # tag&inner tags to grab
    for tag, inner in tags_inner.items(): # saves resource files and rename refs
        savenRename(soup, pagefolder, session, url, tag, inner)
    with open(path+'.html', 'wb') as file: # saves modified html doc
        file.write(soup.prettify('utf-8'))
Zee Kay
  • 73
  • 1
  • 5
  • Interesting. So does using Selenium solve that problem? – Zee Kay Apr 15 '23 at 20:16
  • 2
    Look into using `wget`. That's what it was designed for. It's a command-line program. – MattDMo Apr 15 '23 at 20:17
  • `wget` should help https://pypi.org/project/wget/ – Dmitriy Neledva Apr 15 '23 at 20:19
  • for selenium you just need to have the browser wait for the page to fully load, there's a command for that, and that should solve your incomplete page problem. – Ahmed AEK Apr 15 '23 at 20:21
  • 1
    @DmitriyNeledva that project was last updated in 2015. I suggest using the [GNU utility](https://www.gnu.org/software/wget/) instead. If you google, there's also a Windows version if you don't have MSYS or Cygwin or whatever. It's installed on many Linux distributions by default, if not it's in the package management system. – MattDMo Apr 15 '23 at 20:23

0 Answers0