This is not a good use for regex if you're trying to do it all in one pattern. It's possible to do, but I suspect the universe will cool before you work all the bugs out.
To understand the scope of what you are trying to do, read Wikipedia's article on "Posting Style". There are a lot of different ways replies are embedded into an email message, partly controlled by the MUA (mail user agent) and partly by the person doing the reply. There isn't a set method of doing the attribution, and no rule saying that the reply is in one block on the page, or that it is at the top of the page. This means that any code you write will have to be very sophisticated in order to have a chance of working consistently.
Have you looked at Mail
? It's already written, it's well tested, it's got all sorts of cool bells and whistles, and it's already written. (I said it again because reinventing wheels that work well can be really painful.)
Parsing plain text email is one task. Then there is MIME-encoded email, with different content types. Then there is "HTML" email that doesn't have MIME blocks, but instead some moron just figured everyone liked HTML formatting and blinking text. Then there's various weirdly broken types of message bodies with four reply quoting types and the full content of all the previous messages appended one below the next, and the signatures of the horribly frustrated wanna-be writers who include the whole text of my favorite book "Girl to Grab", AKA Vol. 5 of Encyclopedia Britannica. Mail
can help break out all the garbage for you, giving you a good shot at the content you need.
To grab a range of text in a body, look at Ruby's ..
(AKA "flip-flop") operator. It's designed to return a Boolean true/false when two different tests occur. See "When would a Ruby flip-flop be useful?"
Typically you'd build it like:
if ((string =~ /pattern1/) .. (string =~ /pattern2/))
...
end
As processing occurs, if the first test matches something then subsequent loops will fall into the if
block. When the ending test is found the block will be turned off for subsequent loops. In this case you'd want to use either a string literal, or a small regex to locate your starting and ending lines. If you have a chance of seeing the starting pattern in later text then you'll have to figure out how to trap that.
For instance, here's a way to grab some content that appears to meet your stated requirements if someone does a top-reply:
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
puts '=' * 40
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad@sitename.com> wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
And here is the output:
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
# >> ========================================
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
The pattern could be simpler, but if it was it would increase the chance of returning false-positives.