Looking for ideas on how to match a pattern, Possible or not?

Question

I'm looking for assistance creating a pattern match to ingest emails. The end goal is to recieve an incoming message and extract just the reply message, not all the trailing junk (previous threads, signature, datastamp header, etc...)

Here are the two same formats:

Format 1:

The Message is here, etc etc can span a random # of lines

On Nov 17, 2010, at 4:18 PM, Person Name wrote:

lots of junk down here which we don't want

Format 2:

The Message is here, etc etc can span a random # of lines

On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad@sitename.com> wrote:

lots of junk down here which we don't want

Format 3:

The Message is here, etc etc can span a random # of lines

On Fri, Nov 19, 2010 at 1:57 AM, <customerserviceonline@pge.com> wrote:

lots of junk down here which we don't want

For both examples above, I'd like to create a pattern match that finds the first instance of the 2nd line. And then returns only whats above that line. I don't want that delimiter line.

I can't match on the date stamp, but I can match on everything after the comma as that's in my control.

So the idea, Looks for either either of these two static items:

, Site <yadaaaa+adad@sitename.com> wrote:
, Person Name wrote:

And then take everything above that position. What do you think. Is this possible?

probably should have mentioned. I'm using Rails 3. so a ruby method is ideal. — AnApprentice, Nov 18 '10 at 17:44
@meagar, that's awesome. I wasn't even sure! I'm a newbie, maybe you can provide a few tips, so I have a starting direction? I'm never done this before. Just started learning ruby on rails a month ago. — AnApprentice, Nov 18 '10 at 17:45
Month is enough to find at least one tutorial about regex... — Nakilon, Nov 18 '10 at 17:48
@Nakilon, thanks I've found that, real simple regex type finds in Ruby doing something like split().first, but how do I tell Ruby to ignore that line and take all the lines above? — AnApprentice, Nov 18 '10 at 17:50
Are there any tokens (sequence of characters) you can use to distinguish or delimit the message in question? The reason I'm asking is because `On Nov 17, 2010, at 4:18 PM, Site wrote:` could even be in the message itself, which is *no good*. — John, Nov 18 '10 at 17:58
@John isn't ", Site wrote:" the token / sequence of characters? — AnApprentice, Nov 18 '10 at 18:09
Sorry, what I meant is that it isn't unique, and that the actual token you're thinking of using as a delimiter could potentially be part of the message. In other words, you could end up with multiple matches. — John, Nov 18 '10 at 18:17
@john, anythings possible but I think that's very rare. Goal is to find the first instance. Ideas? — AnApprentice, Nov 18 '10 at 18:18
If that is the rule, it would make parsing much easier. I'm a bit bias. I love using regex for things like this. Unfortunately I don't have time to conjure something up for you at the moment. I just wanted to help steer the conversation in the right direction. I'll try to get to this tonight if no one else answers by then. — John, Nov 18 '10 at 18:25
I should mention though, that if the rule is that the first instance of this token marks the end of the message, then VP's answer is not a bad option either. Again, some caution should be taken against parsing large files. — John, Nov 18 '10 at 18:27

score 2 · Answer 1 · answered Nov 18 '10 at 17:52

2

i would add a different approach: Why you don't read everything and break when you match the line that you have as stop?

answered Nov 18 '10 at 17:52

VP.

5,122
6
46
71

@Vp, thanks I like the push back! The issue is the first part of that line, the datestamp, isn't something I'll know to match against, it's random, it's only the comman and after " Site wrote:" which is static and I can match against. Right? – AnApprentice Nov 18 '10 at 17:53
This could work, but I always express caution processing text files this way, especially if they have the potential to be large. If you will be processing text this way, it is best to do it as a background process. – John Nov 18 '10 at 18:04
Thanks I'm open for suggestions? – AnApprentice Nov 18 '10 at 18:08
can you force html email? it would be easier to parse – VP. Nov 18 '10 at 18:29
that's an interesting idea. I can't force HTML email, I was thinking plain text would actually simplify things? How would I read everything and break when matched? – AnApprentice Nov 18 '10 at 18:33
I'm new at Regex, here's what I'm trying so far, is this headed in the right direction? gsub(/^On/+/,Person Name wrote:$/, '').first.strip – AnApprentice Nov 18 '10 at 18:34
1

if you do something like: if i found the line break split(/Site/)[0] you will get what come before the Site, in the same line – VP. Nov 18 '10 at 18:34
Hehe, I think you mean VP. =) – John Nov 18 '10 at 19:25
woops! sorry... What about something like this, it doesn't work yet but is it close? "sub(/\A.*^\SOn \w+ \d+, \d+ at.* wrote:.*/m, '').first" – AnApprentice Nov 18 '10 at 19:26

score 1 · Answer 2 · edited May 23 '17 at 10:26

This is not a good use for regex if you're trying to do it all in one pattern. It's possible to do, but I suspect the universe will cool before you work all the bugs out.

To understand the scope of what you are trying to do, read Wikipedia's article on "Posting Style". There are a lot of different ways replies are embedded into an email message, partly controlled by the MUA (mail user agent) and partly by the person doing the reply. There isn't a set method of doing the attribution, and no rule saying that the reply is in one block on the page, or that it is at the top of the page. This means that any code you write will have to be very sophisticated in order to have a chance of working consistently.

Have you looked at Mail? It's already written, it's well tested, it's got all sorts of cool bells and whistles, and it's already written. (I said it again because reinventing wheels that work well can be really painful.)

Parsing plain text email is one task. Then there is MIME-encoded email, with different content types. Then there is "HTML" email that doesn't have MIME blocks, but instead some moron just figured everyone liked HTML formatting and blinking text. Then there's various weirdly broken types of message bodies with four reply quoting types and the full content of all the previous messages appended one below the next, and the signatures of the horribly frustrated wanna-be writers who include the whole text of my favorite book "Girl to Grab", AKA Vol. 5 of Encyclopedia Britannica. Mail can help break out all the garbage for you, giving you a good shot at the content you need.

To grab a range of text in a body, look at Ruby's .. (AKA "flip-flop") operator. It's designed to return a Boolean true/false when two different tests occur. See "When would a Ruby flip-flop be useful?"

Typically you'd build it like:

if ((string =~ /pattern1/) .. (string =~ /pattern2/))
    ...
end

As processing occurs, if the first test matches something then subsequent loops will fall into the if block. When the ending test is found the block will be turned off for subsequent loops. In this case you'd want to use either a string literal, or a small regex to locate your starting and ending lines. If you have a chance of seeing the starting pattern in later text then you'll have to figure out how to trap that.

For instance, here's a way to grab some content that appears to meet your stated requirements if someone does a top-reply:

msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod

On Nov 17, 2010, at 4:18 PM, Person Name wrote:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT

body = []
msg.lines.each do |li|
  li.chomp!
  body << li
  break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]

puts '=' * 40

msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod

On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad@sitename.com> wrote:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT

body = []
msg.lines.each do |li|
  li.chomp!
  body << li
  break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]

And here is the output:

# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >> 
# >> ========================================
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>

The pattern could be simpler, but if it was it would increase the chance of returning false-positives.

Me either. Obviously someone was afraid to say why too. – the Tin Man Nov 18 '10 at 22:32 — the Tin Man, Nov 18 '10 at 22:32

Nicolas Guillaume · Accepted Answer · 2010-11-20T09:27:43.227

1

Well this would be a regexp solution :

/(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/

You just provided one exemple so this might not be perfect but it should do the job quite well.

Then, you have to get the first captured group with $1 or [0] if you are using match :)

regex =  /(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/

if str =~ regex
  puts "S1 : #{$1}"
end

if res = str.match(regex)
  puts "S2 : #{res[0]}"
end

Btw, you can use the option /i on the regex.

edited Nov 20 '10 at 09:27

answered Nov 19 '10 at 11:12

Nicolas Guillaume

8,160
6
35
44

What if the months aren't abbreviated or the day or abbreviated day precedes the month? – the Tin Man Nov 19 '10 at 15:41
@Greg, as long as the regex matches the two use cases above its good to go for now... What do you think? – AnApprentice Nov 19 '10 at 17:49
What's the if statements all about? – AnApprentice Nov 19 '10 at 17:50
Ahh figured out why this is breaking, it needs to also support "On Fri, Nov 19, 2010 at 1:57 AM, wrote:" I'm updating the question. sorry about that. – AnApprentice Nov 19 '10 at 17:58
The regex appears to be erroring: "body.sub(/(On (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}, at \d{1,2}:\d{1,2} (?:AM|PM), (?:Site <[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/).strip" – AnApprentice Nov 19 '10 at 18:02
Person is listed in the regex that should allow for anything in between it won't be static – AnApprentice Nov 19 '10 at 18:05
Updated to match your 3rd format. Btw, you can replace the days and months abbreviations with "\w{3,5}". It will be more flexible but less filtering :) – Nicolas Guillaume Nov 20 '10 at 09:32

Looking for ideas on how to match a pattern, Possible or not?

3 Answers3