67

I'm trying to match the From line all the way to the end of the Subject line in the following:

....
From: XXXXXX 
Date: Tue, 8 Mar 2011 10:52:42 -0800 
To: XXXXXXX
Subject: XXXXXXX
....

So far I have:

/From:.*Date:.*To:.*Subject/m

But that doesn't match to the end of the subject line. I tried adding $ but that had no effect.

user664833
  • 18,397
  • 19
  • 91
  • 140
AnApprentice
  • 108,152
  • 195
  • 629
  • 1,012
  • You seem to know multiple lines but you don't display multiple lines in you data sample. I don't see any multiple lines at all. Just one long string stretching From: ... Subject XXXXX. This is too simple to write a regex for this. Why aren't you provideing an exact sample ? –  Mar 09 '11 at 00:47
  • Using "Hello" and "thxs" is discouraged in Stack Overflow. http://meta.stackexchange.com/questions/2950/should-hi-thanks-and-taglines-and-salutations-be-removed-from-posts – Andrew Grimm Mar 09 '11 at 00:56
  • @sln: The data sample does have multiple lines, but the OP didn't bother checking that it formatted as he intended. – Andrew Grimm Mar 09 '11 at 00:58
  • @Andrew Grimm - I surmised that before I posted. Yet 4 answers appeared before you fixed the OP's formatting. Kinda strange –  Mar 09 '11 at 01:24
  • @Andrew Grimm - I know its in the interest of learning here on SO, but it appears voting is corrupting that principle. –  Mar 09 '11 at 01:28

5 Answers5

91

You can use the /m modifier to enable multiline mode (i.e. to allow . to match newlines), and you can use ? to perform non-greedy matching:

message = <<-MSG
Random Line 1
Random Line 2
From: person@example.com
Date: 01-01-2011
To: friend@example.com
Subject: This is the subject line
Random Line 3
Random Line 4
MSG

message.match(/(From:.*Subject.*?)\n/m)[1]
=> "From: person@example.com\nDate: 01-01-2011\nTo: friend@example.com\nSubject: This is the subject line"

See http://ruby-doc.org/core/Regexp.html and search for "multiline mode" and "greedy by default".

user664833
  • 18,397
  • 19
  • 91
  • 140
Pan Thomakos
  • 34,082
  • 9
  • 88
  • 85
  • That works great. is there anything wrong with using this? The other two answers seem a little against this method? – AnApprentice Mar 09 '11 at 00:38
  • I don't see anything wrong with this approach. The specifics really depend on what you want to capture with the regexp. I think the main things to keep in mind are the ? operator and the /m switch. These two techniques will really let you hone in on multi-line data with a regexp. – Pan Thomakos Mar 09 '11 at 00:45
  • 1
    @AnApprentice- The main drawback is that using a regular expression to do this places very strict requirements on the formatting of the input. This technique works for this specific example, but it may not work if there is any variation at all in the input (which order the fields are listed in, etc). I've had a number of bad experiences with single regular expressions that cover multiple input lines, and I usually encourage a more general, non-regex solution. If your input is strictly controlled and will always adhere to this exact format, then you should be able to use something like this. – bta Mar 09 '11 at 00:47
  • 1
    You shouldn't rule out regular expressions just because the format of your input changes, you might just need to modify your expressions or extract data in a different way. For instance, if you are worried about the ordering of the fields, then instead of using a single regular expression for all the data (from, to, date, subject) you can use 4 different regular expressions to capture each piece of data individually. – Pan Thomakos Mar 09 '11 at 01:00
  • If the input text has more than one subject. How can I select the text till the first occurence of Subject. – MIZ Jun 30 '14 at 11:19
12

If you are using ruby, you can try :

Regexp.new("some reg", Regexp::MULTILINE)

If you are not using ruby, I suggest you hack this question:

  1. replace all the "\n" with SOME_SPECIAL_TOKEN
  2. search the regexp, and do other operations...
  3. restore: replace SOME_SPECIAL_TOKEN with "\n"
Siwei
  • 19,858
  • 7
  • 75
  • 95
5

If you want to match across linebreaks, one possibility is to first replace all newline characters with some other character (or character sequence) that wouldn't otherwise appear in the text. For example, if you have all of the text in one string variable you can do something like aString.split("\n").join("|") to replace all newlines in the string with pipe characters.

Also, look at Alan Moore's answer to your previous question regarding how to match the newline character in a regular expression.

Community
  • 1
  • 1
bta
  • 43,959
  • 6
  • 69
  • 99
  • Thanks bta, I'd rather not replace newline characters. Is there not regex way to do this? – AnApprentice Mar 09 '11 at 00:35
  • Regardless of the method, this seems like an ugly use of a regular expression. It would probably be cleaner and more robust to create a class that parses out the individual fields and stores them in member variables. Since it looks like you're parsing an email message, there's probably already a class out there that will do that for you. – bta Mar 09 '11 at 00:36
4

Try:

/...^Subject:[^\n]*/m

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
1

Using the following data:

From: XXXXXX
Date: Tue, 8 Mar 2011 10:52:42 -0800
To: XXXXXXX
Subject: XXXXXXX

The following regex will do the magic:

From:([^\r\n]+)[\r\n]+Date:([^\r\n]+)[\r\n]+To:([^\r\n]+)[\r\n]+Subject:([^\r\n]+)[\r\n]+

But I would recommend that you don't try and do this in 1 regex. Push into a regex "^(\w+):(.+)$" line by line, unless you are sure that the sequence of the FROM/DATE/TO/SUBJECT is not going to change ;)

chkdsk
  • 1,187
  • 6
  • 20
  • It will change, sometimes there could be an additional line, ON BEHALF OF for example, so guess this won't work? – AnApprentice Mar 09 '11 at 00:34
  • Best is to use something which breaks each line. – chkdsk Mar 09 '11 at 00:35
  • (sorry new to stackoverflow didn't know enter submits the comment :P ) – chkdsk Mar 09 '11 at 00:35
  • What do you mean by something that breaks each line? – AnApprentice Mar 09 '11 at 00:36
  • Best is to use something which breaks each line, then separate by ":" as in "^\s*(\w+)\s*:\s*(.+)\s*$" then get match 1 as keyword and match 2 as value. push in a hash then check if you got what you wanted at the end of the parsing. Hope this helps – chkdsk Mar 09 '11 at 00:37
  • breaks each line as in: "From: XXXX" is read as 1 line, "Date: XXX" as 2nd line, and so on. – chkdsk Mar 09 '11 at 00:38