Many tools have a way to export a .MHT file. I want a way to convert that single file to a collection of files, an HTML file, the relevant images, and CSS files, that I could then upload to a webhost and be consumable by all browsers. Does anybody know any tools or libraries or algorithms to do this.
-
1What programming language were you intending to use? – IgorGanapolsky Nov 03 '17 at 14:24
8 Answers
Well, you can open the .MHT file in IE and the Save it as a a web page. I tested this with this page, and even though it looked odd in IE (it's IE after all), it saved and then opened fine in Chrome (as in, it looked like it should).
Barring that method, looking at the file itself, text blocks are saved in the file as-is, and all other content is saved in Base64. Each item of content is preceded by:
[Boundary]
Content-Type: [Mime Type]
Content-Transfer-Encoding: [Encoding Type]
Content-Location: [Full path of content]
Where [Mime Type], [Encoding Type], and [Full path of content] are variable. [Encoding Type] appears to be either base64 or quoted-printable. [Boundary] is defined in the beginning of the .MHT file like so:
From: <Saved by WebKit>
Subject: converter - How can you programmatically (or with a tool) convert .MHT mhtml files to regular HTML and CSS files? - Stack Overflow
Date: Fri, 9 May 2013 13:53:36 -0400
MIME-Version: 1.0
Content-Type: multipart/related;
type="text/html";
boundary="----=_NextPart_000_0C08_58653ABB.B67612B7"
Using that, you could make your own file parser if needed.

- 751
- 6
- 11
-
so IE will then create a folder and save the images separately etc? I wonder if you can automate IE to do this with is COM object? – klumsy May 09 '13 at 22:06
-
Yep, IE creates a folder with all the images and whatnot. The COM object shows a `Navigate2` function and event handlers (for completion and such), but I couldn't find a save function in it's reference. Doesn't mean it's not there, just that I couldn't find it. – XNargaHuntress May 10 '13 at 13:06
-
I played with this more, and I can load it, and save it hack automate keypress of the save as dialog, which is hacky and fragile . however it wants to save it as MHT and not complete HTML (saving a site online as a complete site works fine), and I can't find a way to specific which save as option using ExecWB, so probably the best thing to do would be just work on processing the MHT with code, or try some other sort of automation with selenium, or forefox or chrome automation or extensions or something.. – klumsy May 15 '13 at 06:17
-
Using part of this as a start, I did make a basic markup-only parser using a short LINQ statement at http://www.poconosystems.com/software-development/converting-mhtml-to-html/ – Yuck Sep 19 '13 at 12:53
Besides IE and MS Word, there's an open-source cross-platform program called 'mht2html' first written in 2007 and last updated in 2016. It has both a GUI and terminal interface.
I haven't tested it yet but it seems to have received good reviews.

- 583
- 7
- 18

- 73
- 1
- 5
MHT file is essentially MIME. So, it's possible to use Chilkat.Mime or completely free System.Net.Mime components to access its internal structure. If, for example, MHT contains images, they can be replaced with base64 strings in the output HTML.
Imports HtmlAgilityPack
Imports Fizzler.Systems.HtmlAgilityPack
Imports Chilkat
Public Function ConvertMhtToHtml(ByVal mhtFile As String) As String
Dim chilkatWholeMime As New Chilkat.Mime
'Load mime'
chilkatWholeMime.LoadMimeFile(mhtFile)
'Get html string, which is 1-st part of mime'
Dim html As String = chilkatWholeMime.GetPart(0).GetBodyDecoded
'Create collection for storing url of images and theirs base64 representations'
Dim allImages As New Specialized.NameValueCollection
'Iterate through mime parts'
For i = 1 To chilkatWholeMime.NumParts - 1
Dim m As Chilkat.Mime = chilkatWholeMime.GetPart(i)
'See if it is image'
If m.IsImage AndAlso m.Encoding = "base64" Then
allImages.Add(m.GetHeaderField("Content-Location"), "data:" + m.ContentType + ";base64," + m.GetBodyEncoded)
End If : m.Dispose()
Next : chilkatWholeMime.Dispose()
'Now it is time to replace the source attribute of all images in HTML with dataURI'
Dim htmlDoc As New HtmlDocument : htmlDoc.LoadHtml(html) : Dim docNode As HtmlNode = htmlDoc.DocumentNode
For i = 0 To allImages.Count - 1
'Select all images, whose src attribute is equal to saved URL'
Dim keyURL As String = allImages.GetKey(i) 'Saved url from MHT'
Dim elementsWithPics() As HtmlNode = docNode.QuerySelectorAll("img[src='" + keyURL + "']").ToArray
Dim imgsrc As String = allImages.GetValues(i)(0) 'dataURI as base64 string'
For j = 0 To elementsWithPics.Length - 1
elementsWithPics(j).SetAttributeValue("src", imgsrc)
Next
'Select all elements, whose style attribute contains saved URL'
elementsWithPics = docNode.QuerySelectorAll("[style~='" + keyURL + "']").ToArray
For j = 0 To elementsWithPics.Length - 1
'Get and modify style'
Dim modStyle As String = Strings.Replace(elementsWithPics(j).GetAttributeValue("style", String.Empty), keyURL, imgsrc, 1, 1, 1)
elementsWithPics(j).SetAttributeValue("style", modStyle)
Next : Erase elementsWithPics
Next
'Get final html'
Dim tw As New StringWriter()
htmlDoc.Save(tw) : html = tw.ToString : tw.Close() : tw.Dispose()
Return html
End Function
-
-
1It's VB.Net. It uses open source package "Fizzler.Systems.HtmlAgilityPack" and commercial package "Chilkat.Mime". But Chilkat can be replaced by "System.Net.Mime" class. – Zagavarr Nov 13 '17 at 12:37
I think that @XGundam05 is correct. Here is what I did to make it work.
I started with a Windows Form project in Visual Studio. Added the WebBrowser to the form and then added two buttons. Then this code:
private void button1_Click(object sender, EventArgs e)
{
webBrowser1.ShowSaveAsDialog();
}
private void button2_Click(object sender, EventArgs e)
{
webBrowser1.Url = new Uri("localfile.mht");
}
You should be able to take this code and add in a list of files and process each one with a foreach
. The webBrowser
contains a method called ShowSaveAsDialog()
; And this will allow one to save as .mht or just the html or the complete page.
EDIT: You could use the webBrowser's Document and scrape the information at this point. By adding a richTextBox and a public variable as per MS here: http://msdn.microsoft.com/en-us/library/ms171713.aspx
public string Code
{
get
{
if (richTextBox1.Text != null)
{
return (richTextBox1.Text);
}
else
{
return ("");
}
}
set
{
richTextBox1.Text = value;
}
}
private void button2_Click(object sender, EventArgs e)
{
webBrowser1.Url = new Uri("localfile.mht");
HtmlElement elem;
if (webBrowser1.Document != null)
{
HtmlElementCollection elems = webBrowser1.Document.GetElementsByTagName("HTML");
if (elems.Count == 1)
{
elem = elems[0];
Code = elem.OuterHtml;
foreach (HtmlElement elem1 in elems)
{
//look for pictures to save
}
}
}
}

- 28,968
- 18
- 162
- 169

- 4,121
- 4
- 39
- 58
-
per your guys solutions and this http://stackoverflow.com/questions/872750/saving-a-web-page-from-ie-using-powershell it seems It doesn't seem possible without the saveas dialog popping up. I was hoping to be able to automate this enmasse – klumsy May 14 '13 at 22:17
-
With the edit you may be able to come up with a process to scrape and save the the HTML and Images. – CaptainBli May 14 '13 at 23:35
So automating IE was difficult and not usable end to end, so I think building some sort of code that does it will be the way to go. on github I found this python one which may be good
https://github.com/Modified/MHTifier http://decodecode.net/elitist/2013/01/mhtifier/
If I have time i'll try to do something similar in PowerShell.

- 4,081
- 5
- 32
- 42
Here's one approach using the mht2html
Java library:
First, you should add the mht2html
dependency to your project. If you're using Maven, add this to your pom.xml
:
<dependency>
<groupId>com.github.kallestenova</groupId>
<artifactId>mht2html</artifactId>
<version>1.0</version>
</dependency>
Then, in your Java or Kotlin code, use the Mht2Html
class to convert the MHT file to HTML:
import com.github.kallestenova.mht2html.Mht2Html;
public class MhtConverter {
public static void main(String[] args) {
Mht2Html mht2Html = new Mht2Html();
mht2Html.convert("input.mht", "output.html", "imagesDirectory");
}
}
This will convert the input.mht
file into an output.html
file, and it will extract all images into the imagesDirectory
. Make sure you replace input.mht
, output.html
, and imagesDirectory
with your actual file and directory names.
Remember that the mht2html
library will only do the MHT to HTML conversion and extraction of images. If there are CSS files or other resources embedded in your MHT file, you'll need to manually extract them or use another library capable of handling these file types.
If you are using .NET, the C#
language has a System.Net.Mail.MailMessage
class that can read MHT files, as they are just email messages. This will allow you to parse and extract any CSS or other files embedded in the MHT file.
Please also keep in mind that while this method works for many MHT files, there may be variations or complexities in some MHT files that this method may not handle perfectly. It's always best to thoroughly test any solution with your specific data to ensure it meets your needs.
This method should give you a consumable HTML file along with associated images that you can upload to a web host.

- 51
- 6
-
1Thank you. Actually I'm using Gradle in Android Studio. Is the source code of that library available? I haven't found it on Github. – xralf Jul 20 '23 at 13:04
Firefox has embedded tool. Go to menu (press Alt if hidden) File->Convert saved pages
.

- 7,233
- 5
- 52
- 105