Coding of images in Blink archive

Question

I have a Blink archive (in mht format) saved from Chrome browser. I'm trying to convert the section

Content-Type: image/jpeg
Content-Transfer-Encoding: binary
Content-Location: https://some_url

ÿØÿà^@^PJFIF^@^A^A^A^@`^@`^@^@ÿÛ^@C^@^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^AÿÛ^
^KÿÄ^@µ^P^@^B^A^C^C^B^D^C^E^E^D^D^@^@^A}^A^B^C^@

to image file as follows

string s = "\nÿØÿà^@^PJ..."
byte [] result = System.Convert.FromBase64String(s)
File.WriteAllBytes("image.jpg", result);

And I have an error message The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.

How can I fix it? There are probably \n characters in the string. When I replace \n with empty string it does not help.

Could you please send the file? I guess that some characters are missing when you copy the file content here. — Alireza Roshanzamir, Aug 02 '23 at 21:12
[Here](https://drive.google.com/file/d/1eWZe-GgFTRqr8BZM_rMAFJM62Us7NUu4/view?usp=drive_link) is the mhtml file called `robot`. I'm now at Windows, so I don't have now the initial file (from the example), but the problem here is the same. I can't display the result image. — xralf, Aug 03 '23 at 05:52
But actually you can try any mhtml file saved via Chrome browser on Android OS. — xralf, Aug 03 '23 at 05:56
@AlirezaRoshanzamir [Here](https://drive.google.com/file/d/1CrVYdB-4oatJVCEzbFv0N5ZNeT2jWNHg/view?usp=drive_link) is the file as is in Ubuntu. — xralf, Aug 03 '23 at 19:45
Is [this](https://ibb.co/5k7vNWy) the kind of image you expected to see? — Alireza Roshanzamir, Aug 03 '23 at 19:55
It's probably corrupted. It's only [this](https://vtm.zive.cz/clanky/robot-atlas-predvadi-dokonaly-parkour-tentokrat-boston-dynamics-pridava-i-nepovedene-zabery/sc-870-a-211811/default.aspx) webpage saved in Chrome in Android. — xralf, Aug 03 '23 at 19:57
I don't know why, but the images inside the file you sent are corrupted. I've opened your file [using many applications](https://ibb.co/HxLLL2z) and found out that the images themselves are corrupted. It seems that my extraction and conversion processes are working correctly (I've checked the results with the normal JPG headers), and I could obtain images similar to those inside the file programmatically or using Linux terminal applications. — Alireza Roshanzamir, Aug 03 '23 at 20:44
@AlirezaRoshanzamir May that be problem, that I sent the file from Android to my computer via e-mail? — xralf, Aug 03 '23 at 20:47
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254782/discussion-between-alireza-roshanzamir-and-xralf). — Alireza Roshanzamir, Aug 03 '23 at 21:00

score 1 · Answer 1 · answered Jul 29 '23 at 23:07

From the snippet you've shared, it seems the data you have is not Base64 encoded, but instead directly represents the bytes of a JPEG file (as seen from the magic number ÿØÿà at the start, which corresponds to JPEG).

If this is the case, you don't need to perform a Base64 conversion at all, you need to convert this string to bytes directly.

In C#, you can use the Encoding class to convert a string to bytes. If the string represents bytes as UTF-8, you can convert it like so:

string s = "\nÿØÿà^@^PJ...";
byte[] result = Encoding.UTF8.GetBytes(s);
File.WriteAllBytes("image.jpg", result);

The error message disappeared but I don't have the image, even when I replace all `\r\n` with empty string. — xralf, Jul 29 '23 at 23:24

Alireza Roshanzamir · Accepted Answer · 2023-08-05T12:53:10.107

Because you mentioned that you want to implement your solution in Java, I developed a simple solution that can be easily converted to Java.

The following code reads the robot.mhtml file and dumps the content of each part to separate files in the out/ directory:

using System.Text;
using System.Text.RegularExpressions;

Encoding encoding = Encoding.GetEncoding("ISO-8859-1");

string mhtml = File.ReadAllText("./robot.mhtml", encoding);

MatchCollection matches = Regex.Matches(
    mhtml,
    @"Content-Location: .*/(?<name>.*)\n\r\n(?<content>(\n|.)+?)(?=\n------MultipartBoundary--)"
);

Directory.CreateDirectory("out");

foreach (Match match in matches)
{
    File.WriteAllText("out/" + match.Groups["name"].Value, match.Groups["content"].Value, encoding);
}

I tested it, and it works:

Let me provide a complete explanation of the Regex for you:

The regex attempts to extract each part name (using the final part of the Content-Location header) and its content.
Without the Singleline flag, the . includes everything except \n. Therefore, when we intend to include everything, including new lines, we should use (.|\n).
Following the HTTP protocol, there is a single additional \r\n between the headers and content.
The (?<group_name>pattern) creates a regex group with the name group_name and a specified pattern, allowing us to request the matches to return only these specific parts from the complete match.
The +? signifies that it should not extend the text greedily. If you use a simple +, it captures content until the last \n------MultipartBoundary-- (resulting in only one file being extracted). However, we aim to capture content until the first occurrence (visit here for more information).
The .+(?=sequence) implies searching until the sequence is located (see here for more information).

Some other notes:

HTTP messages are encoded with ISO-8859-1. So, you should read and write files using this encoding.
This is a file on which I tested my solution. I visited your mentioned website and downloaded the page using Chrome on Android.
To achieve the same result in Java, you should take into account the default flags and behaviors of Java's Regex. Nevertheless, I believe they are similar to those in C#.

In addition to coding and logging, you can test your customized regex in this dotnet-specific regex tester to observe the results and captured groups:

I have to admit, that the regex quite baffles me. Could you show the regex without capture groups yet, to be better understood? Do you know some tool (e.g. under Ubuntu) where I can play with the regex a little? Thank you. — xralf, Aug 05 '23 at 10:23
I removed redundant parts from the regex and added an explanation to my solution. Is that okay? — Alireza Roshanzamir, Aug 05 '23 at 12:39
Thanks a lot, this makes more sense. I need to practice more regexes :-) — xralf, Aug 05 '23 at 18:19

Coding of images in Blink archive

2 Answers2