How to extract email headers extending on multiple lines from file

Question

I am trying to extract the To header from an email file using sed on linux.

The problem is that the To header could be on multiple lines.

e.g:

To: name1@mydomain.org, name2@mydomain.org,
    name3@mydomain.org, name4@mydomain.org, 
    name5@mydomain.org
Message-ID: <46608700.369886.1549009227948@domain.org>

I tried the following:

sed -n -e '/^[Tt]o: / { N; p; }' _message_file_ |
    awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'

The sed command extracts the line starting with To and next line. I pipe the output to awk to put everything on a single line.

The full command outputs in one line:

To: name1@mydomain.org, name2@mydomain.org, name3@mydomain.org, name4@mydomain.org

I don't know how to keep going and test if the next line starts with whitespace and add it to the result.

What I want is all the addresses

To: name1@mydomain.org, name2@mydomain.org, name3@mydomain.org, name4@mydomain.org, name5@mydomain.org

Any help will be appreciated.

Try this https://stackoverflow.com/questions/4857424/extract-lines-between-2-tokens-in-a-text-file-using-bash — newbie, Feb 01 '19 at 15:06
You really need to use procmail/formail for this. See http://www.tutorialspoint.com/unix_commands/procmail.htm and http://www.tutorialspoint.com/unix_commands/formail.htm — Ed Morton, Feb 01 '19 at 15:10
@EdMorton: yes thank you. I did it like this: cat _message_2 | formail -X To: | awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}' — Anis Bedhiafi, Feb 01 '19 at 15:21

jhnc · Answer 1 · 2019-02-02T19:51:24.130

4

formail is a good solution but here's how to do it with sed:

sed -e '/^$/q;/^To:/!d;n;:c;/^\s/!d;n;bc' message_file

/^$/q; - (optional) quit if we run out of headers
/^To:/!d; - if not a To: header, stop processing this line
n; - otherwise, implicitly print it, and load next line
:c; - c is a label we can branch to
/^\s/!d; - if not a contination, stop processing this line
n; - otherwise, implicitly print it, and load next line
bc - branch back to label c (ie. loop)

edited Feb 02 '19 at 19:51

answered Feb 01 '19 at 18:01

jhnc

11,310
1
9
26

Yes, it works if we just add a small ^ in front of the To otherwise it will print the first line containing To in the file. The command is now: sed -e '/^$/q;/^To:/!d;n;:c;/^\s/!d;n;bc' _message_file – Anis Bedhiafi Feb 02 '19 at 19:45
Sorry, that was a typo. Of course there should have been a `^` (it was in the comment!). – jhnc Feb 02 '19 at 19:51
Just one little thing. The command adds an extra space in the end. But the simplest solution I could find is : formail -X To: < _message_2 – Anis Bedhiafi Feb 02 '19 at 19:52
The command proposed by @potong sed '/^\S/h;G;/^To:/MP;d' file just does the job in few actions – Anis Bedhiafi Feb 02 '19 at 19:53

score 3 · Answer 2 · answered Oct 06 '20 at 20:29

Both formail and reformail have a -c option to do exactly that.

From man reformail:

-c   Concatenate multi-line headers. Headers split on multiple lines
     are combined into a single line.

So you don't need to pipe the output to awk, and can just do

reformail -c -X To: < $your_message_file

However, emails normally use CRLF line endings, and the output on screen may be garbled because of the CR characters. To remove them, you can use Perl's generic \R line ending in a regex on the output :

reformail -c -X To: < $your_message_file | perl -pe 's/\R/\n/g'

or do it on the input if you prefer:

perl -pe 's/\R/\n/g' $your_message_file | reformail -c -X To:

On Debian and derived systems like Ubuntu, you can install them with

apt install maildrop for reformail, which is part of Courier's maildrop
or apt install procmail for formail (but procmail seems to be abandoned now).

Anis Bedhiafi · Answer 3 · 2019-02-02T19:49:56.057

2

I did it like this:

cat _message_file | formail -X To: | awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'

Or:

formail -X To: < _message_file | awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'

edited Feb 02 '19 at 19:49

answered Feb 01 '19 at 15:22

Anis Bedhiafi

185
1
5
22

1

That's a [useless use of `cat`](/questions/11710552/useless-use-of-cat) – tripleee Feb 01 '19 at 18:07
Yes you are right. We can do like this: formail -X To: < message_file | awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}' – Anis Bedhiafi Feb 02 '19 at 19:49

score 1 · Answer 4 · answered Feb 02 '19 at 09:52

1

This might work for you (GNU sed):

sed -n '/^To:/{:a;N;/^ /Ms/\s*\n\s*/ /;ta;P}' file

Turn off implicit printing by using the -n option. Gather up the lines starting with white space, removing white space either side of the newline and replace it by a single space, starting from the line that begins To:. When matching fails, print the first line in the pattern space.

To print addresses as is, use:

sed '/^\S/h;G;/^To:/MP;d' file

answered Feb 02 '19 at 09:52

potong

55,640
6
51
83

The second command worked! The first just prints the first line. Thank you – Anis Bedhiafi Feb 02 '19 at 19:41
Okay, that second one is a bit mindbending :) I guess it's mostly portable (though not as short!) if written as: `sed '/^[^ ^T]/h;G;/\nTo:/P;d' file` where `^T` is a literal tab character – jhnc Feb 02 '19 at 20:27

score 0 · Answer 5 · answered Aug 26 '22 at 01:24

0

It could be as straightforward as this:

sed -n '/^To:/{
    :a
    p
    n
    /^[[:space:]]/ba
}'

Be silent, but starting from the To: header print the text line by line while it still relevant to the header.

answered Aug 26 '22 at 01:24

Mike Volokhov

119
1
3

How to extract email headers extending on multiple lines from file

5 Answers5

Linked