0

I'd like to be able to download a HTML page (let's say this actual question!):

f = urllib2.urlopen('https://stackoverflow.com/questions/33914277')
content = f.read()       # soup = BeautifulSoup(content) could be useful?
g = open("mypage.html", 'w')
g.write(content)
g.close()

such that it is displayed the same way locally than online. Currently here is the (bad) result:


(source: gget.it)

Thus, one need to download CSS, and modify the HTML itself such that it points to this local CSS file... and the same for images, etc.

How to do this? (I think there should be simpler than this answer, that doesn't handle CSS, but how? Library?)

Community
  • 1
  • 1
Basj
  • 41,386
  • 99
  • 383
  • 673
  • 2
    Possible duplicate of [Download image file from the HTML page source using python?](http://stackoverflow.com/questions/257409/download-image-file-from-the-html-page-source-using-python) – K DawG Nov 25 '15 at 10:42
  • @KDawG : I have linked this question in my own question, haven't you seen? The difficult CSS part is not handled. – Basj Nov 25 '15 at 10:43
  • With today's use of javascript it's unreasonable to download everything locally as you cannot know what resources the site has. – simonzack Nov 25 '15 at 10:50
  • @simonzack that's why I would like to limit the "scraping" to CSS and images. Such that, for example, this [precise page](http://stackoverflow.com/questions/33914277) could be saved locally. – Basj Nov 25 '15 at 10:53
  • @Basj Javascript is able to load any css or image it likes. – simonzack Nov 25 '15 at 10:54
  • I can't seem to understand your logic here, don't reinvent the wheel, just use the solution that has already been provided. – K DawG Nov 25 '15 at 12:26
  • Take a look at [this answer](http://stackoverflow.com/a/4200547/408556) – reubano Nov 26 '15 at 08:39
  • But my personal favorite is [httrack](https://www.httrack.com/) – reubano Nov 26 '15 at 08:39

1 Answers1

0

Since css and image files fall under CORS policy, from your local html you still can refer to them while they are in the cloud. The problem is unresolved URIs. In the html head section you have smth. like this:

    <head> 
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="stylesheet" type="text/css" href="/assets/8943fcf6/select.css" />
    <link href="/css/media.css" rel="stylesheet" type="text/css">
    <script type="text/javascript" src="/assets/jquery.yii.js"></script>
    <script type="text/javascript" src="/assets/select.js"></script>
</head> 

Obviously /css/media.css implies base address, ex. http://example.com. To resolve it for local file you need to make http://example.com/css/media.css as href value in your local copy of html. So now you should parse and add the base into the local code:

    <head> 
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="stylesheet" type="text/css" href="http://example.com/assets/select.css" />
    <link href="http://example.com/css/media.css" rel="stylesheet" type="text/css">
    <script type="text/javascript" src="http://example.com/assets/jquery.yii.js"></script>
    <script type="text/javascript" src="http://example.com/assets/select.js"></script>
</head> 

Use any means for that (js, php...)

Update

Since a local file also contains images' references throughout the body section you'll need to resolve them too.

Igor Savinkin
  • 5,669
  • 8
  • 37
  • 69