Is there a way to use bash to get specific text content of a .eml?

Question

Total noob here with both bash and working with .eml files, so bare with me...

I have a folder with many saved .eml files, and I want a bash script (if this is not possible with bash, I'm willing to use python, or zsh, or maybe perl--never used perl before, but it may be good to learn) that will print the email content after a line containing a specific textual phrase, and before the next empty line.

I also want this script to combine consecutive lines ending in "=". (Lines that do not end with an "=" sign should continue printing on a new line.)

All of my testing with .txt files that I create manually work fine, but when I use an actual .eml file, then things stop working.

Here is a portion of a sample .eml file:

(.eml file continues above)
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

testing
StartLine (This is where stuff begins)
This is a line that should be printed.
This is a long line that should be printed. Soooooooooooooooooooooooooooooo=
 Loooooooooooooooooooooooonnnnnnnnnggggg. Soooooooooooooooooooooooooooooo L=
oooooooooooooooooooooooonnnnnnnnnggggg. Soooooooooooooooooooooooooooooo Loo=
oooooooooooooooooooooonnnnnnnnnggggg.

This is where things should stop (no more printing)
Don=92t print me please!
Don=92t print me please!
Don=92t print me please!




[This message is from an external sender.]

(.eml file continues below)

I want the script to output:

This is a line that should be printed.
This is a long line that should be printed. Soooooooooooooooooooooooooooooo Loooooooooooooooooooooooonnnnnnnnnggggg. Soooooooooooooooooooooooooooooo Loooooooooooooooooooooooonnnnnnnnnggggg. Soooooooooooooooooooooooooooooo Loooooooooooooooooooooooonnnnnnnnnggggg.

Here is my script so far:

#!/bin/bash
files="/Users/username/Desktop/emails/*"
specifictext="StartLine"

for f in $files
do
     begin=false
     previous=""
     while read -r line
     do
          if [[ -z "$line" ]] #this doesn't seem to be working right
          then
               begin=false
          fi

          if [[ "$begin" = true ]]
          then
               if [[ "${line:0-1}" = "=" ]] #this also doesn't appear to be working
               then
                    previous=$previous"${line::${#line}-1}"
               else
                    echo $previous$line
               fi
          fi

          if [[ $line = "$specifictext"* ]]
          then
               begin=true
          fi

     done < "$f"
done

This will successfully skip everything up to and including the line containing $specifictext, but then it will print off the entire remainder of each email instead of stopping at the next empty line. Like this:

$ ./printeml.sh 
This is a line that should be printed.
This is a long line that should be printed. Soooooooooooooooooooooooooooooo=
Loooooooooooooooooooooooonnnnnnnnnggggg. Soooooooooooooooooooooooooooooo L=
oooooooooooooooooooooooonnnnnnnnnggggg. Soooooooooooooooooooooooooooooo Loo=
oooooooooooooooooooooonnnnnnnnnggggg.

This is where things should stop (no more printing)
Don=92t print me please!
Don=92t print me please!
Don=92t print me please!




[This message is from an external sender.]

(continues printing remainder of .eml)

As you can see above, the other issue I'm having is that I wanted to get combine lines with "=" signs at the end, but that is not working. It appears all the testing I do with test files works fine, except when I use an actual .eml file. I think this is an issue with hidden characters in .eml files, but I'm not really sure how that works.

I'm using bash version 3.2.57(1) on MacOS 12.4.

I suspect the "empty" line is not empty but contains a carriage return. Maybe try `if [[ -z "$line" || $line = $'\r' ]]` — Mark Reed, Jun 15 '22 at 15:48
BTW, debugging your script with `bash -x yourscript` will show the above. — Charles Duffy, Jun 15 '22 at 16:16
"Content-Transfer-Encoding: quoted-printable" -- what you really want to do, I think, is to decode that quoted-printable message part. That's something bash isn't really suited for. Pick a general purpose language that has email processing libraries . — glenn jackman, Jun 15 '22 at 16:19
Thanks @MarkReed! That did the trick to resolve the main issue where it would print the entire remainder of the .eml file. Do you have any thoughts to combine lines that end in "="? — orchardl, Jun 15 '22 at 16:44
That's a good tip @CharlesDuffy. I'll definitely utilize that more in debugging. Thanks:) — orchardl, Jun 15 '22 at 16:48

Mark Reed · Answer 1 · 2022-06-15T17:31:33.100

Both of your problems stem from the fact that the .eml file is using Windows line endings (really, MIME line endings; the specification is designed for transmission over the TELNET protocol and thus dictates the use of CRLF instead of bare LF). Bash doesn't understand those, and sees the carriage return as an ordinary character that happens to be the last character of every line. So the blank lines are really single-character lines containing a carriage return, and the lines ending in an = really end in = followed by a carriage return ($'=\r'). When you check the last character, you're getting the carriage return, which of course is never =.

But that's just part of the problem. You could convert the file to UNIX line-endings (though it wouldn't be a valid .eml file at that point) or account for the CRs in your code. However, the trailing equal sign for continued lines is just one part of the "quoted printable" encoding scheme that the Content-Encoding header tells you the message body is using. Another thing you may run into is that Q-P messages cannot legally contain any characters outside the ASCII range, but must use =xx with two hex digits to represent such characters. Any Windows-1252 characters whose code point is > 127 will be replaced by =xx with the code in hexadecimal – as will any literal equal signs, which become =3D.

So you should ideally be using some library that understands MIME messages rather than trying to roll your own code to do bits and pieces of the decoding. Perhaps a Perl script using the MIME::Parser module would be appropriate? Or you could use the Python answers given to this question.

Is there a way to use bash to get specific text content of a .eml?

1 Answers1