Decode the utf8 to ISO-8859-1 mail subject to text in .procmailrc file

Question

Set out to write a simple procmail recipie that would forward the mail if it found the text "ABC Store: New Order" in the subject.

 :0
    * ^(To|From).*abc@cdefgh.com
    * ^Subject:.*ABC Store: New Order*
    {

Unfortunately the subject field in the mail message coming from the mail server was in MIME encoded-word syntax.

Subject: =?UTF-8?B?QUJDIFN0b3JlOiBOZXcgT3JkZXI=?=

The above subject is utf-8 ISO-8859-1 charset, So was wondering if there are any mechanisms/scripts/utilities to parse this and convert to string format so that I could apply my procmail filter.

What you are looking at is a RFC2047-encoded header. Like it says in the charset part, it is in UTF-8, base64-encoded. There is no ISO-8859-1 here (that's a different encoding; it can't be in ISO-8859-1 aka Latin-1 if it's in UTF-8). — tripleee, Apr 20 '15 at 08:00
In the general case, the repertoire of UTF-8 is much larger than the repertoire of ISO-8859-1, so you will not always be able to translate UTF-8 to ISO-8859-1. If you only care about unwrapping the RFC2047 encoding and recovering the UTF-8 text, that's always possible (and perhaps a better thing to do). — tripleee, Apr 20 '15 at 08:03

AnFi · Accepted Answer · 2020-03-07T19:29:36.883

20

You may use perl one liner to decode Subject: before assigment to procmail variable.

# Store "may be encoded" Subject: into $SUBJECT after conversion to ISO-8859-1
:0 h
* ^Subject:.*=\?
SUBJECT=| formail -cXSubject: | perl -MEncode=from_to -pe 'from_to $_, "MIME-Header", "iso-8859-1"'

# Store all remaining cases of Subject: into $SUBJECT
:0 hE
SUBJECT=| formail -cXSubject:

# trigger recipe based also on $SUBJECT content
:0
* ^(To|From).*abc@cdefgh.com
* SUBJECT ?? ^Subject:.*ABC Store: New Order
{
....
}

Comment (2020-03-07): It may be better to convert to UTF-8 charset instead of ISO-8859-*.

edited Mar 07 '20 at 19:29

answered Apr 18 '15 at 10:25

AnFi

10,493
3
23
47

1

Nice. I had no idea that `MIME-Header` was an available encoding – Borodin Apr 18 '15 at 16:12
1

Though the `r*` in the regex `New Order*` is kind of silly, and arguably wrong. – tripleee Apr 20 '15 at 04:52
Why is the command for the "remaining cases" like this: `SUBJECT=| formail -cXSubject` **without a colon**, unlike the command for the first case: `SUBJECT=| formail -cXSubject: |`? – imz -- Ivan Zakharyaschev Mar 26 '17 at 16:15
I have fixed example to syntax as in `man formail` examples. Basic test of ` formail -cXSubject` seem to produce correct results too. – AnFi Mar 26 '17 at 23:52
1

The argument to `formail -x` is just a string prefix; without the colon you will extract every header which *starts* with `Subject`; of course, in practice, unless you are running a fuzz tester or something, only `Subject:` will actually match. – tripleee Sep 03 '20 at 09:35

score 1 · Answer 2 · answered Apr 18 '15 at 15:04

1

You should use MIME::EncWords.

Like this

use strict;
use warnings;
use 5.010;

use MIME::EncWords 'decode_mimewords';

my $subject = '=?UTF-8?B?QUJDIFN0b3JlOiBOZXcgT3JkZXI=?=';
my $decoded = decode_mimewords($subject);
say $decoded;

output

ABC Store: New Order

answered Apr 18 '15 at 15:04

Borodin

126,100
9
70
144

This only unwraps the RFC2047 encoding; the result is still in UTF-8. Because the OP's regex doesn't contain any characters where the encoding differs between ISO-8859-1 and UTF-8, it doesn't seem to matter; but if you want to match text which is not pure ASCII, the encoding does matter, and you should know which encoding you are using. (Like I argue in another comment, I would actually suggest to keep everything in UTF-8; but that is perhaps not what the OP is requesting. Though the question is unclear on this part.) – tripleee Apr 20 '15 at 08:05

Decode the utf8 to ISO-8859-1 mail subject to text in .procmailrc file

2 Answers2