How best to convert mixed-encoded mailbox file to UTF-8?

Question

I have an .mbox mailbox file containing thousands of email messages in different languages, variously encoded in ASCII, ISO-8859-1 and UTF-8. I want to "flatten" the file into UTF-8.

My first effort was to loop through the file, doing a file -b --mime-encoding on each character, and an iconv -f ISO-8859-1 -t UTF-8 on any character detected as ISO-8859-1. I understand that UTF-8 is a superset of ASCII, so only ISO-8859-1 needs conversion.

This took forever and for some reason did not work as expected. Problem characters remained.

Is there an obvious way of doing this in a one-liner, or will it be necessary to resort to formail to convert the file messages by message?

If you want your messages to appear right in you mail program the real encoding must match the header "Content-Type" and "Content-Transfer-Encoding" and you cannot achieve this by such crude conversion of the whole file. — dr_agon, Dec 31 '14 at 04:59

rloth · Answer 1 · 2014-12-28T12:59:51.760

3

As far as I know MIME mails and their containers .mbox files are always encoded in ASCII format with non-ASCII source characters presented in QP form.

for instance 'é' is represented by '=E9' in all my .mbox files (no matter what the original encoding of the message was)
see this wikipedia entry on quoted-printable encoding

It means that any original non-ascii characters you'll encounter won't be in iso-8859-1 or whatever, but have already been converted into something that fits the following regex : =[0-9A-F]{2}

You can convert QP encoding simply using sed and echo -e in this way

sed -re 's/=([0-9A-F]{2})/\\\\u00\1/g' | while read L ; do echo -e $L ; done

Explanation :

sed will substitute all QP forms of two hex digits like "=E9" into unicode codepoints like "\u00E9"
echo -e can convert the latter into their character form (since bash 4.2)

edited Dec 28 '14 at 12:59

answered Dec 28 '14 at 12:54

rloth

100
1
6

Interesting tip about quoted-printable, thanks! Unfortunately, after passing through various editors over the years, the text which now makes up my .mbox is definitively not just ASCII. And saving it in Thunderbird doesn't flatten it into QP ASCII either, I just checked. :( – Jortstek Dec 28 '14 at 19:14
Hmm ok sorry... then iconv may be the way but only on chunks of text and **not** "character by character" as you tried: until you know the encoding the only thing you have is bytes and it's meaningless to talk about characters yet. – rloth Dec 29 '14 at 11:03
Hmm, perhaps split the mbox and then group the non-ASCII ones for additional treatment? If they are truly encoded then Pandoc may be of use. i.e. something like: awk '/^B/{close("file"f);f++}{print $0 > "file"f}' input.mbox to split and additional REGEX to further build you edit list; when all done then 're-cat' to a new mbox file? :) – Dale_Reagan Dec 30 '14 at 02:52
@Dale_Reagan Thanks for the suggestion, I didn't know about Pandoc. – Jortstek Jan 03 '15 at 00:51
QP does not mean, that the character is utf8 encoded. For example `=?iso-8859-1?Q?N=E4he?=` contains the german umlaut `ä`, but with `echo -e '\u00E4'` it would become `▒`. – mgutt Feb 08 '23 at 15:43

LSerni · Answer 2 · 2018-01-10T11:45:00.913

recode has support for decoding from surfaces, i.e. Quoted-Printable or Base64 as well as charsets. So you would do:

recode CP1252/QP..UTF-8 < filein > fileout

One "real" problem now lies here (emphasis mine):

thousands of email messages in different languages, variously encoded in ASCII, ISO-8859-1 and UTF-8

The recode request is different between those files. Trivially, ASCII and UTF-8 files do not require recoding. You need to examine all those files and find out, say, iso-8859-1 ones:

find . -name "*.mbox" -exec file -i "{}" ";" \
   | grep -v "\(us-ascii\|utf-8\)$" \
   | sed -e 's/^\([^:]*\): .*; charset=\([^=]*\)$/recode \2\/QP..utf-8 < "\1" > "\1.tmp" && mv "\1.tmp" "\1"/g' \
   > recode-script.sh

Another problem is that at least in my limited experience, a good fraction of the files might not be encoded in a Quoted-Printable surface (you'll have noticed that file recognizes ISO-8859-1 even if Quoted-Printable actually gives you an ASCII7 file) and you'd need to recognize them, which requires parsing the mbox format (also because, while unlikely, you could even have different multipart sections with different charsets and/or surfaces in the same message, and straight decoding the whole file with a single matrix would decode some sections and damage others).

So, for best results, unless you're sure you only have ISO-8859-1(5) files, formail is your friend. You can pre-filter the files with a variation of the above script to focus on files actually in need of conversion (files resulting as ascii or utf-8 require no modification). If you discover that the files requiring recoding are all in the same surface, then recode will probably have the best performances.

Note: I remember seeing an utility that would get a list of text files in input, and output those files in a single stream separated by ">>>filename<<<". It was called stitch (my google-fu is not up to the task of finding it again just now). The same utility would get such a stream and split it back into the original separate files, in such a way that ls *.txt | stitch | stitch -u would not damage the files themselves. One could use this approach to run a single recode process efficiently on many small files.

score 1 · Answer 3 · answered Feb 08 '23 at 15:51

I was not able to realize a fast one-liner, but this should cover all relevant cases like different charsets and even base64 encoded strings:

while read -r encoded; do
  # obtain charset, encoding and content
  IFS="?" read -r tmp charset encoding content tmp <<< "$encoded"
  echo "$encoding" | grep -iqF "b" && encoding="BASE64" || encoding="QP"
  # decode content
  decoded=$(echo "${content//_/ }" | recode "$charset/$encoding..utf8")
  # replace encoded string against decoded string
  encoded=$(echo "$encoded" | sed -e 's/[]\/$*.^[]/\\&/g')
  decoded=$(echo "$decoded" | sed -e 's/[]\/$*.^[]/\\&/g')
  sed -i "s/$encoded/$decoded/g" "$mail_file"
done < <(grep -o "=?.*?=" "$mail_file" )

Maybe someone can adapt this with awk to make it faster.

Escaping borrowed from here.

How best to convert mixed-encoded mailbox file to UTF-8?

3 Answers3