1

I can bring up a web page, no problem. I can save the webpage...as html, no problem. I need to save the webpage as mht so I can can get all the html that gets hidden without saving as mht. In researching I'm coming up with absolutely nothing as to how to save as mht using python. Like I said above I can try to save it as a mht file, using the standard coded for saving as html but that simply doesn't work...not surprised it doesn't work either, but it was worth a shot.

url = 'https://www.thewebsite.com'
html = urllib.request.urlopen(url).read()

m = open('websitetest.mht', 'w')
m.write(str(html))
m.close()

The site I'm trying to save does 'hidden code' that comes across when saved as mht, but not when saved as html. Hence why I'm trying to save as mht so I get all the code and then can go through the code to pull off what I need to compile a database.

confused
  • 1,283
  • 6
  • 21
  • 37

2 Answers2

1

There is a very handy Github project coded in Python 2.7 (you need to make simple modifications to make it compatible with Python 3.4). This project has code for packing/unpacking MHT files. I think this is what you are looking for:

Un/packs an MHT (MHTML) archive into/from separate files, writing/reading them in directories to match their Content-Location.

kajarigd
  • 1,299
  • 3
  • 28
  • 46
1

Recently came accross the same issue, I wanted to convert html page to mht format.

Followed Tim Golden's Python stuff and was able to achieve it using win32com. http://timgolden.me.uk/python/win32_how_do_i/create-an-mhtml-archive.html

import win32com.client as win32

URL = r'C:\WorkSpace\chetan_index.html'  # issues found 1> One while using local files, pass the path in url format like file://directory01/directory02/index.html with %20 formating for special characters
                                         #              2> Also same to be followed for files reffered internally inside html file i.e. src="file://reference/directory01/smiley.png"
                                         #              3> Rare issue, if alt tag is found with src, images are not embedded into mht coreectly, trying poping alt tag from web page and then call CreateMHTMLBody

message = win32.gencache.EnsureDispatch('CDO.Message')
message.CreateMHTMLBody(URL, 0)  # 0 - suppress none , download all images and others
stream = win32.gencache.EnsureDispatch(message.GetStream())
stream.SaveToFile(r'C:\temp\saved_mht.mht', 2)  # 2, for overwrite existing file, 1 for not to overwrite
stream.Close()
Chetan
  • 644
  • 6
  • 7