12

I'm creating a basic system that allows users to reply to a thread on the website via email. However, most email clients include the text of the previous emails in their reply emails. This text is unwanted on the website.

Is there a reliable way in which I can extract only the new message, without prior knowledge about the earlier emails? I'm using the email class of Python.


Example message:

Content-Type: text/plain; charset=ISO-8859-1

test message! This is the part I want.

On Thu, Mar 24, 2011 at 3:51 PM, <test@test.com> wrote:

> Hi!
>
> Herman just posted a comment on the website:
>
>
> From: Herman
> "Hi there! I might be interested"
>
>
> Regards,
> The Website Team
> http://www.test.com
>

This is a reply message from gmail, I'm sure other clients might do it differently. A good start would probably be to ignore the lines that start with >, but there could also be lines like that in between the new message, and then they probably should be kept. I'll also still have the content-type line and the date line.

Herman Schaaf
  • 46,821
  • 21
  • 100
  • 139

5 Answers5

4

The formatting of email replies depend on the clients. There is no realiable way to extract the newest message without the risk of removing too much or not enough.

However, a common way to mark quotes is by prefixing them with > so lines starting with that character - especially if there are multiple at the very end or beginning of the email - are likely to be quotes.

But the On Thu, Mar 24, 2011 at 3:51 PM, <test@test.com> wrote: from your example is hard to extract. A line ending with a : right before a quote might indicate that it belongs to the quote, you cannot know that for sure - it could also be part of the new message and the colon is just a typo'd . (on german keyboards : is SHIFT+.).

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
  • It's a shame that there's no real reliable way, but the ideas you gave proved helpful, especially the point on German keyboards - thanks. – Herman Schaaf Mar 24 '11 at 14:27
  • I'd agree there's no reliable way and there really should be but each client differs. We have been experimenting at CloudMailin with a way to include this parsing by default but we don't have anything that catches everything we want yet. – Steve Smith Mar 25 '11 at 11:46
1

I think this should work

import re
string_list = re.findall(r"\w+\s+\w+[,]\s+\w+\s+\d+[,]\s+\d+\s+\w+\s+\d+[:]\d+\s+\w+.*", strings) # regex for On Thu, Mar 24, 2011 at 3:51 PM
res = strings.split(string_list[0]) # split on that match
print(res[0]) # get before string of the regex
LAMRIN TAWSRAS
  • 302
  • 1
  • 8
1

The answer @LAMRIN TAWSRAS gave will work for parsing the text before the Gmail date expression only if a match is found, otherwise an error will be thrown. Also, there isn't a need to search the entire message for multiple date expressions, you just need the first one found. Therefore, I would refine his solution to use re.search():

def get_body_before_gmail_reply_date(msg):
  body_before_gmail_reply = msg
  # regex for date format like "On Thu, Mar 24, 2011 at 3:51 PM"
  matching_string_obj = re.search(r"\w+\s+\w+[,]\s+\w+\s+\d+[,]\s+\d+\s+\w+\s+\d+[:]\d+\s+\w+.*", msg)
  if matching_string_obj:
    # split on that match, group() returns full matched string
    body_before_gmail_reply_list = msg.split(matching_string_obj.group())
    # string before the regex match, so the body of the email
    body_before_gmail_reply = body_before_gmail_reply_list[0]
  return body_before_gmail_reply
lweibel
  • 11
  • 2
0

Try this:

import re
def deleteForwardedMessagesFromMessage(message):
    nextMessage = re.split(r"\n.*[\,].*\<\s*.*>", message)[0]
    print(nextMessage)
    return nextMessage
0

Try this it works for french/english emails :

On Thu, Mar 24, 2011 at 3:51 PM, test@test.com wrote:

Le mer. 28 avr. 2021 à 10:03, test.test@orange.com a écrit :

regex=r'\w+\s+\w+[,.]\s+(\w+\s+\d+|\d+\s+\w+)[,.]\s+\d+\s+(\w+|\à)\s+\d+[:]\d+(\s+\w+)?,?\s+(\s*<?\s*[a-zA-Z][._-]?[a-zA-Z][@]\w+[.]?\w{2,3}\s*>?;?\s*)(a écrit|wrote)\s*:'