0

I'm trying to parse out the text of an email reply and drop the quoted text (and anything that follows it, including the signature)

This code is returning: message tests On Tue, Jun 25, 2013 at 10:01 PM, Catie Brand <

I want it to return simply message tests

What regex am I missing?

def format_mail_plain(value, from_address):
    res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'\s+wrote:', re.IGNORECASE  | re.MULTILINE),
           re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
           re.compile(r'from:\s*$', re.IGNORECASE),
           re.compile(r'^>.*$', re.IGNORECASE | re.MULTILINE)]

    whitespace_re = re.compile(r'\s+')

    lines = list(line.rstrip() for line in value.split('\n'))

    result = ''
    for line_number, line in zip(range(len(lines)), lines):
        for reg_ex in res:
            if reg_ex.search(line):
                return result

        if not whitespace_re.match(line):
            if '' is result:
                result += line
            else:
                result += '\n' + line

    return result




************************ Sample Text *****************************
message tests 
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX < 
conversations+yB1oupeCJzMOBj@xxxx.com> wrote: 
> ** 
>    [image: Krow] <http://www.krow.com/>


************************ Result **********************************
message tests
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX <

I'd rather the result be:

************************ Result **********************************
message tests
fansonly
  • 1,150
  • 4
  • 14
  • 29
  • Can you show the sample text current output and desired output? Ideally examples of how it's failing – Ro Yo Mi Jun 26 '13 at 03:15
  • Why are you generating the line numbers, which you aren't using? Also, if you actually need the line numbers, have you considered using the Python builtin `enumerate`? [PEP 279](http://www.python.org/dev/peps/pep-0279/) and it's also the answer to this question: http://stackoverflow.com/questions/126524/iterate-a-list-with-indexes-in-python – pcurry Jun 26 '13 at 03:28

2 Answers2

1

In your sample input, On.*?wrote does not match, because On ... wrote: spans two lines.

I changed your code to substitute On.*wrote:\s* to empty string.

def format_mail_plain(value, from_address):
    value = re.compile(r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL).sub('', value)
    res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
           re.compile(r'^from:', re.IGNORECASE),
           re.compile(r'^>')]

    lines = filter(None, [line.rstrip() for line in value.split('\n')])

    result = []
    for line in lines:
        result.append(line)
        for reg_ex in res:
            if reg_ex.search(line):
                result.pop()
                break

    return '\n'.join(filter(None, result))
falsetru
  • 357,413
  • 63
  • 732
  • 636
0

The regex that you are expecting to catch 'On Tue, Jun 25 ...' is

re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL)

That won't match because the 'wrote' in your sample text has already been split to another line by the time the regex sees the string. Since you want to stop processing the message after you have seen that string, replace it with something that will otherwise trigger your processing loop to exit, before you split the string. I would suggest the leading quote character '>'. falsetru caught this first, I incorporated the replacement idea into my answer.

Your regular expressions seem to be written to not use alternation at all. Was that at an attempt at improving performance?

I would reduce the number of regular expressions, eliminate lines of whitespace from being processed at the list generation stage, and use substrings to test singe and two-character regular expressions. Try this:

def format_mail_plain(value, from_address):
    on_wrote_regex = re.compile(
        r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL)
    value = on_wrote_regex.sub('>', value)
    res = [re.compile(r'from:\s*(' + re.escape(from_address) +)|$, re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'\s+wrote:', re.IGNORECASE),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE)]

    result = ''
    for line in (text_line.rstrip() 
                 for text_line in value.split('\n') 
                 if text_line.strip()):
        if line[0] == '>':
            return result

        for reg_ex in res:
            if reg_ex.search(line):
                return result

        if '' is result:
            result += line
        else:
            result += '\n' + line

    return result
Community
  • 1
  • 1
pcurry
  • 1,374
  • 11
  • 23