8

With PCRE regular expressions in PHP, multi-line mode (/m) enables ^ and $ to match the start and end of lines (separated by newlines) in the source text, as well as the start and end of the source text.

This appears to work great on Linux with \n (LF) being the newline separator, but fails on Windows with \r\n (CRLF).

Is there any way to change what PCRE thinks are newlines? Or to perhaps allow it to match either CRLF or LF in the same way that $ matches the end of line/string?

EXAMPLE:

$EOL = "\n";    // Linux LF
$SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four";
if (preg_match('/^two$/m',$SOURCE_TEXT)) {
    echo 'Found match.';    // <<< RESULT
} else {
    echo 'Did not find match!';
}

RESULT: Success

$EOL = "\r\n";    // Windows CR+LF
$SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four";
if (preg_match('/^two$/m',$SOURCE_TEXT)) {
    echo 'Found match.';
} else {
    echo 'Did not find match!';    // <<< RESULT
}

RESULT: Fail

MrWhite
  • 43,179
  • 8
  • 60
  • 84

4 Answers4

9

Did you try the (*CRLF) and related modifiers? They are detailed on Wikipedia here (under Newline/linebreak options) and seem to do the right thing in my testing. i.e. '/(*CRLF)^two$/m' should match the windows \r\n newlines. Also (*ANYCRLF) should match both linux and windows but I haven't tested this.

Ben Holland
  • 106
  • 1
  • 1
  • 2
    Yes, this works for me too (including `(*ANYCRLF)`) when specified at the start of the pattern. Note that these modifiers are available since PCRE 7.3, which [corresponds to PHP 5.2.5](http://www.php.net/manual/en/pcre.installation.php). – MrWhite Jun 25 '12 at 11:08
5

Note: The answer is only applicable to older PHP versions, when I wrote it, I was not aware of the sequences and modifiers that are available: \R, (*BSR_ANYCRLF) and (*BSR_UNICODE). See as well the answer to: How to replace different newline styles in PHP the smartest way?

In PHP it's not possible to specify the newline character-sequence(s) for PCRE regex patterns. The m modifier is looking for \n only, that's documented. And there is no runtime setting available to make a change which would be possible in perl but that's not an option with PHP.

I normally just modify the string prior using it with preg_match and the like:

$subject = str_replace("\r\n", "\n", $subject);

This might not be exactly what you're looking for but probably it helps.

Edit: Regarding the windows EOL example you've added to your question:

$EOL = "\r\n";    // Windows CR+LF
$SOURCE_TEXT = "one{$EOL}two{$EOL}three{$EOL}four";
if (preg_match('/^two$/m',$SOURCE_TEXT)) {
    echo 'Found match.';
} else {
    echo 'Did not find match!';    // <<< RESULT
}

This fails because in the text, there is a \r after two. So two is not at the end of a line, there is an additional character, \r before the end of the line ($).

The PHP manual clearly explains that only \n is considered as the character that specifies a line ending. $ does consider \n only, so if you're looking for two\r at the end of a line, you need to change your pattern. That's the other option (instead of converting the text as suggested above).

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • I don't the documentation is particularly clear on this, all it states is: "If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect." – MrWhite Jul 25 '11 at 11:17
  • It clearly states that `\n` is the (only) newline-character-sequence reflected by the `m` modifier. – hakre Jul 25 '11 at 11:57
  • I added some explanation to the answer as well for the code you've added. – hakre Jul 25 '11 at 12:04
  • Yes, this certainly seems to be the case; thanks for the clarification. However, I don't believe this is clearly stated in the documentation, unless you have another source? The link you gave to the [Pattern Modifiers](http://php.net/manual/en/reference.pcre.pattern.modifiers.php), from which I quoted above does not _clearly_ state this, although it could perhaps be construed as loosely inferring this. – MrWhite Jul 25 '11 at 14:00
  • @w3d: Well I don't understand what to argue about. It specifies which character is treated as line ending for the `m` modifier. You even quoted that. As `\n` is the only character that makes a difference, how can you think another character than `\n` would make a difference as well? Why should the documentation only list a subset, not the superset? It *could* be, but that would be speculation. From what I read there I would not expect `\r\n` to be treated as line-ending in multiline mode. Especially as it's written that `\n` is treated as line-ending. – hakre Jul 25 '11 at 14:33
3

Thats strange, I don't think that $ (with m modifier) cares if there is a \n or \r\n as new line.

An idea to test this, add \s* before the $. \s is matching also newline characters and should match then the \r before the \n if this would be really the problem.
As long as its no problem if there are additional whitespaces at the end of the line, it shouldn't hurt.

stema
  • 90,351
  • 20
  • 107
  • 135
  • Well I wouldn't have thought it should matter either, but my regex appears to fail matching the eol `\r\n`. The start of the line matches OK. I had assumed this was because the end character was `\r` and not `\n`? I've added an example to my question which appears to show this. Thanks for the `\s*` suggestion - that does indeed appear to resolve the issue. – MrWhite Jul 25 '11 at 10:35
0

It all depends on where your data comes from - external and uncontrolled sources might provide quite messy data. A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m).

<?php 
// Various OS-es have various end line (a.k.a line break) chars:
// - Windows uses CR+LF (\r\n);
// - Linux LF (\n);
// - OSX CR (\r).
// And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?).
$str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_";
//          C          3                   p          0                   _
$pat1='/\w$/mi';    // This works excellent in JavaScript (Firefox 7.0.1+)
$pat2='/\w\r?$/mi'; // Slightly better
$pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org when used improperly
$pat4='/\w(?=\R)/i';    // Much better with allowed lookahead assertion (just to detect without capture) without multiline (/m) mode; note that with alternative for end of string ((?=\R|$)) it would grab all 7 elements as expected
$pat5='/\w\v?$/mi';
$pat6='/(*ANYCRLF)\w$/mi';  // Excellent but undocumented on php.net at the moment (described on pcre.org and en.wikipedia.org)
$n=preg_match_all($pat1, $str, $m1);
$o=preg_match_all($pat2, $str, $m2);
$p=preg_match_all($pat3, $str, $m3);
$r=preg_match_all($pat4, $str, $m4);
$s=preg_match_all($pat5, $str, $m5);
$t=preg_match_all($pat6, $str, $m6);
echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true)
    ."\n2 !!! $pat2 ($o): ".print_r($m2[0], true)
    ."\n3 !!! $pat3 ($p): ".print_r($m3[0], true)
    ."\n4 !!! $pat4 ($r): ".print_r($m4[0], true)
    ."\n5 !!! $pat5 ($s): ".print_r($m5[0], true)
    ."\n6 !!! $pat6 ($t): ".print_r($m6[0], true);
// Note the difference among the three very helpful escape sequences in $pat2 (\r), $pat3 and $pat4 (\R), $pat5 (\v) and altered newline option in $pat6 ((*ANYCRLF)) - for some applications at least.

/* The code above results in the following output:
ABC ABC

123 123
def def
nop nop
890 890
QRS QRS

~-_ ~-_
1 !!! /\w$/mi (3): Array
(
    [0] => C
    [1] => 0
    [2] => _
)

2 !!! /\w\r?$/mi (5): Array
(
    [0] => C
    [1] => 3
    [2] => p
    [3] => 0
    [4] => _
)

3 !!! /\w\R?$/mi (5): Array
(
    [0] => C

    [1] => 3
    [2] => p
    [3] => 0
    [4] => _
)

4 !!! /\w(?=\R)/i (6): Array
(
    [0] => C
    [1] => 3
    [2] => f
    [3] => p
    [4] => 0
    [5] => S
)

5 !!! /\w\v?$/mi (5): Array
(
    [0] => C

    [1] => 3
    [2] => p
    [3] => 0
    [4] => _
)

6 !!! /(*ANYCRLF)\w$/mi (7): Array
(
    [0] => C
    [1] => 3
    [2] => f
    [3] => p
    [4] => 0
    [5] => S
    [6] => _
)
 */
?>

Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.

Wirek
  • 1
  • 2