recode
has support for decoding from surfaces, i.e. Quoted-Printable
or Base64
as well as charsets. So you would do:
recode CP1252/QP..UTF-8 < filein > fileout
One "real" problem now lies here (emphasis mine):
thousands of email messages in different languages, variously encoded in ASCII, ISO-8859-1 and UTF-8
The recode request is different between those files. Trivially, ASCII and UTF-8 files do not require recoding. You need to examine all those files and find out, say, iso-8859-1 ones:
find . -name "*.mbox" -exec file -i "{}" ";" \
| grep -v "\(us-ascii\|utf-8\)$" \
| sed -e 's/^\([^:]*\): .*; charset=\([^=]*\)$/recode \2\/QP..utf-8 < "\1" > "\1.tmp" && mv "\1.tmp" "\1"/g' \
> recode-script.sh
Another problem is that at least in my limited experience, a good fraction of the files might not be encoded in a Quoted-Printable surface (you'll have noticed that file
recognizes ISO-8859-1 even if Quoted-Printable actually gives you an ASCII7 file) and you'd need to recognize them, which requires parsing the mbox format (also because, while unlikely, you could even have different multipart sections with different charsets and/or surfaces in the same message, and straight decoding the whole file with a single matrix would decode some sections and damage others).
So, for best results, unless you're sure you only have ISO-8859-1(5) files, formail
is your friend. You can pre-filter the files with a variation of the above script to focus on files actually in need of conversion (files resulting as ascii or utf-8 require no modification). If you discover that the files requiring recoding are all in the same surface, then recode
will probably have the best performances.
Note: I remember seeing an utility that would get a list of text files in input, and output those files in a single stream separated by ">>>filename<<<". It was called stitch
(my google-fu is not up to the task of finding it again just now). The same utility would get such a stream and split it back into the original separate files, in such a way that ls *.txt | stitch | stitch -u
would not damage the files themselves. One could use this approach to run a single recode
process efficiently on many small files.