2

This is a variant on this question and this other question (by myself).

I have a string that I need to parse using regex. The string is something like:

The XXX is blue.
The XXX is blue,
and the YYY is green.
The XXX is blue,
and the YYY is green.
The XXX is blue.
The XXX is blue.
The XXX is blue.
The XXX is blue.
The XXX is blue,
and the YYY is green.

The code above represents one single string, including line feeds. Note how some sentences are followed by an optional subclause after a comma. In those two-part sentences, the YYY "belongs to" the preceding XXX.

I need to match all the XXX and their corresponding YYY, so the result should look something like:

[1][1] XXX
[1][2]
[2][1] XXX
[2][2] YYY
[3][1] XXX
[3][2] YYY
[4][1] XXX
[4][2]
[5][1] XXX
[5][2]
etc.

XXX and YYY could be any character (".*")

How can I write a regex that will match both XXX and YYY? (Remember, YYY could be optional. I use PHP.)

Community
  • 1
  • 1

2 Answers2

2

The answer to this is very similar to the first question you linked:

The (.*?) is blue(?:\.|,\nand the (.*?) is green\.)

See it working: http://www.rubular.com/r/MONXq83J80

Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • won't the `\n` only match actual line feed characters? don't you need to escape them again? – Code Jockey Apr 10 '12 at 19:16
  • 1
    @CodeJockey - Considering the OP states "please note the newlines!" in his post, I think these are actually line-feed characters, as opposed to a `\ ` followed by an `n`. – Andrew Clark Apr 10 '12 at 19:21
  • @what - I just edited my answer, it should now be doing what you want, if you still have problems I can try to provide a PHP code example. – Andrew Clark Apr 10 '12 at 19:50
  • Sorry, F.J, I can't vote up, because I don't have 15 reputation, but your answer works perfectly. Thank you! :-) –  Apr 10 '12 at 20:01
  • Glad it worked, you can [accept my answer](http://meta.stackexchange.com/a/5235/155356) by clicking on the outline of the check mark next to the answer, which is even better than an up vote :) – Andrew Clark Apr 10 '12 at 20:04
0

Since it seems to be all of the same sentence structure, the triggers could just be
The/the. It would be silly to be verbose.

/^The (.*) is.*(?:\n.*the (.*) is)?/m

global and multiline mode (only, dot does not include newline)

  • I knew that before I posted it. –  Apr 11 '12 at 22:09
  • lol just wanted to thank you with words, because I don't have the reputation to vote up and already accepted another answer –  Apr 12 '12 at 04:44