48

I'm currently working on a system that allows users to reply to notification emails that are sent out (sigh).

I need to strip out the replies and signatures, so that I'm left with the actual content of the reply, without all the noise.

Does anyone have any suggestions about the best way to do this?

Jim Neath
  • 1,217
  • 2
  • 14
  • 19

9 Answers9

46

If your system is in-house and/or you have a limited number of reply formats, it's possible to do a pretty good job. Here are the filters we have set up for email responses to trac tickets:

Drop all text after and including:

  1. Lines that equal '-- \n' (standard email sig delimiter)
  2. Lines that equal '--\n' (people often forget the space in sig delimiter; and this is not that common outside sigs)
  3. Lines that begin with '-----Original Message-----' (MS Outlook default)
  4. Lines that begin with '________________________________' (32 underscores, Outlook again)
  5. Lines that begin with 'On ' and end with ' wrote:\n' (OS X Mail.app default)
  6. Lines that begin with 'From: ' (failsafe four Outlook and some other reply formats)
  7. Lines that begin with 'Sent from my iPhone'
  8. Lines that begin with 'Sent from my BlackBerry'

Numbers 3 and 4 are 'begin with' instead of 'equals' because sometimes users will squash lines together on accident.

We try to be more liberal about stripping out replies, since it's much more of an annoyance (to us) have reply garbage than it is to correct missing text.

Anybody have other formats from the wild that they want to share?

Jerome Ansia
  • 6,854
  • 11
  • 53
  • 99
onecreativenerd
  • 415
  • 6
  • 9
  • 6
    gmail uses `
    ...
    `, and yahoo: `
    ...
    ` for quoted part (including "On .. wrote:")
    – Milovan Zogovic Feb 18 '14 at 09:37
  • It may be also be necessary to assume there may be one or more space character before the line (and at the end of the line for the lines end with \n matches). Line 5 for example could be `\n\s*On .* wrote:\s*\n*` – kehers Feb 05 '19 at 07:00
10

Check out the email_reply_parser gem - https://github.com/github/email_reply_parser . It does a nice job handling this problem.

DrewB
  • 3,231
  • 24
  • 22
  • Last I checked that project was not very well maintained (6 months for pull requests to get comments, etc). I suggest https://github.com/lawrencepit/email_reply_parser instead. – DrewB Aug 25 '14 at 15:33
  • Python library for mail parsing: https://github.com/alfonsrv/mailparser-reply/ – schlumpfpirat Jan 04 '23 at 21:56
8

I don't believe you can do this reliably (signatures used to begin with '--' but I don't see that anymore). Perhaps you're better off asking people to reply inbetween text headers and then simply strip the reply from this ? It's not elegant, but perhaps more reliable.

e.g.

REPLY BETWEEN HERE -->

AND HERE -->

so you'd simply look for the required headers above and take what's inbetween.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
4

If you want something powerful & robust, and don't mind reading academic publications, you might check out this:

Here's the homepage for one of the authors, with more info & some downloads:

mkopala
  • 1,262
  • 3
  • 12
  • 15
1

An approach that can be used for signature only (in addition to detect __ or --) is to test if the first name and/or family name of the sender is on a short line (~ containing 3 to 4 words, max).

The sender name is on the raw email header, most of the time next to the email address, like in: From: John Doe <jdoe@provider.com>

This would be based on the assumption that you rarely write your own name in a email, and if you do so, it is probably in a long sentence.

Of course there will be some false positive, but it may not be a big problem depending on what you do (we use it to fold quoted text and signature into a ... gmail-style button, so overdetection does not end up into losing any content, it is just misplaced).

qnilab
  • 380
  • 3
  • 18
0

If you can assume that these emails are in plain text, just strip lines that begins with ">" as replies, and "-- " line should delimit signature. But those assumptions might not work, as not all people over internet use software that complies to rules.

samuil
  • 5,001
  • 1
  • 37
  • 44
  • This is the problem. I don't believe you can automate this reliably. – Brian Agnew Sep 03 '09 at 11:47
  • 1
    Not sure why this has been downvoted since it's technically the same as upvoted accepted answer only a bit less comprehensive. Certainly not a wrong answer. –  Apr 15 '14 at 13:23
0

There's a really nice PHP library dedicated to the email parsing

http://williamdurand.fr/EmailReplyParser/

https://github.com/willdurand/EmailReplyParser

kachar
  • 2,310
  • 30
  • 32
0

I made one for golang: https://github.com/web-ridge/email-reply-parser it detects signatures like

Karen The Green
Graphic Designer
Office
Tel: +44423423423423
Fax: +44234234234234
karen@webby.com
Street 2, City, Zeeland, 4694EG, NL
www.thing.com

The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.

Met vriendelijke groeten,
Richard Lindhout
Richard Lindhout
  • 2,038
  • 2
  • 23
  • 38
-3

The recommended signature delimiter is "-- \n". If people follow this recommendation, stripping signatures should be easy.

Community
  • 1
  • 1