Because you mentioned that you want to implement your solution in Java, I developed a simple solution that can be easily converted to Java.
The following code reads the robot.mhtml
file and dumps the content of each part to separate files in the out/
directory:
using System.Text;
using System.Text.RegularExpressions;
Encoding encoding = Encoding.GetEncoding("ISO-8859-1");
string mhtml = File.ReadAllText("./robot.mhtml", encoding);
MatchCollection matches = Regex.Matches(
mhtml,
@"Content-Location: .*/(?<name>.*)\n\r\n(?<content>(\n|.)+?)(?=\n------MultipartBoundary--)"
);
Directory.CreateDirectory("out");
foreach (Match match in matches)
{
File.WriteAllText("out/" + match.Groups["name"].Value, match.Groups["content"].Value, encoding);
}
I tested it, and it works:

Let me provide a complete explanation of the Regex for you:
- The regex attempts to extract each part name (using the final part of the
Content-Location
header) and its content.
- Without the Singleline flag, the
.
includes everything except \n
. Therefore, when we intend to include everything, including new lines, we should use (.|\n)
.
- Following the HTTP protocol, there is a single additional
\r\n
between the headers and content.
- The
(?<group_name>pattern)
creates a regex group with the name group_name
and a specified pattern
, allowing us to request the matches to return only these specific parts from the complete match.
- The
+?
signifies that it should not extend the text greedily. If you use a simple +
, it captures content until the last \n------MultipartBoundary--
(resulting in only one file being extracted). However, we aim to capture content until the first occurrence (visit here for more information).
- The
.+(?=sequence)
implies searching until the sequence
is located (see here for more information).
Some other notes:
- HTTP messages are encoded with
ISO-8859-1
. So, you should read and write files using this encoding.
- This is a file on which I tested my solution. I visited your mentioned website and downloaded the page using Chrome on Android.
- To achieve the same result in Java, you should take into account the default flags and behaviors of Java's Regex. Nevertheless, I believe they are similar to those in C#.
In addition to coding and logging, you can test your customized regex in this dotnet-specific regex tester to observe the results and captured groups:
