3

I have this code to process a config file in Windows:

<?php
$config = '[log]
log_writers[] = "file"
log_writers[] = "screen"

[General]
maintenance_mode = 0
enable_browser_archiving_triggering = 0
enable_sql_optimize_queries = 0
force_ssl = 1';

echo preg_match_all( '/^maintenance_mode[ \t]*=[ \t]*\d$/m', $config );

The echo displays 0

https://onlinephp.io/c/51407

Updating the regex to:

echo preg_match_all( '/^maintenance_mode[ \t]*=[ \t]*\d\s$/m', $config );

results in the expected 1

WHY??


I even verified my sanity in regex101

https://regex101.com/r/CIxCkN/1


Local test environments:

RHEL 7
PHP 5.6.25
PCRE v8.32 2012-11-30

and

Windows Server 2022
PHP 8.2.7
PCRE v10.40 2022-04-14


Per comment request:

var_dump(base64_encode($config));

string(240) "W2xvZ10NCmxvZ193cml0ZXJzW10gPSAiZmlsZSINCmxvZ193cml0ZXJzW10gPSAic2NyZWVuIg0KDQpbR2VuZXJhbF0NCm1haW50ZW5hbmNlX21vZGUgPSAwDQplbmFibGVfYnJvd3Nlcl9hcmNoaXZpbmdfdHJpZ2dlcmluZyA9IDANCmVuYWJsZV9zcWxfb3B0aW1pemVfcXVlcmllcyA9IDANCmZvcmNlX3NzbCA9IDE="

var_dump(bin2hex($config));

string(358) "5b6c6f675d0d0a6c6f675f777269746572735b5d203d202266696c65220d0a6c6f675f777269746572735b5d203d202273637265656e220d0a0d0a5b47656e6572616c5d0d0a6d61696e74656e616e63655f6d6f6465203d20300d0a656e61626c655f62726f777365725f617263686976696e675f74726967676572696e67203d20300d0a656e61626c655f73716c5f6f7074696d697a655f71756572696573203d20300d0a666f7263655f73736c203d2031"

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • Why did you add `\s` at the end of the second regex? There is no start of a line after the whitespace. See [this demo](https://regex101.com/r/CIxCkN/2). – Wiktor Stribiżew Aug 24 '23 at 12:46
  • In the second, regex uses `\s` to match a single whitespace character or new line. In the first one, the character after the `0` in `maintenance_mode = 0` is a newline (`\n`), not the end of the string. – Abdulla Nilam Aug 24 '23 at 12:49
  • 1
    @WiktorStribiżew The first regex should have worked. The fact that adding `\s` makes it work baffles me. – MonkeyZeus Aug 24 '23 at 12:52
  • 1
    why this question is closed, at least without actually answering to the issue? – OMi Shah Aug 24 '23 at 12:52
  • 2
    Something weird, it returns 1 https://3v4l.org/Oglbp while your demo returns 0 https://onlinephp.io/c/51407 for the same code. – OMi Shah Aug 24 '23 at 12:57
  • @AbdullaNilam You seem to be confusing end-of-line `$` with end-of-string `\Z` meta escape. – MonkeyZeus Aug 24 '23 at 13:05
  • 1
    That is a bug on their side. [Ideone PHP tests are correct](https://ideone.com/xVSOgy). Maybe a version-specific issue. – Wiktor Stribiżew Aug 24 '23 at 13:05
  • @WiktorStribiżew I've added the test environments I have access to. I cannot make heads or tails of the failure point. – MonkeyZeus Aug 24 '23 at 13:19
  • 1
    @OMiShah Your findings are interesting. I added my test environment details and cannot figure out which is the failure point =/ – MonkeyZeus Aug 24 '23 at 13:28
  • to really get to the bottom of this, show us the base64 or hex encoded version of your config file, `var_dump(base64_encode($config));` – hanshenrik Aug 24 '23 at 14:10
  • 1
    voted to close as `needs debugging details` - we need `base64_encode($config)` or `bin2hex($config)` to really see why this is happening - all we can do otherwise is speculate (as @iainn does below.) – hanshenrik Aug 24 '23 at 14:12
  • 1
    @hanshenrik Added. You should consider allowing more than .25 seconds before VTC when requesting clarification on a question which was already VTC'd and re-opened by experts. Thanks. – MonkeyZeus Aug 24 '23 at 14:29

3 Answers3

4

One answer would be that your string (or script generally) has Windows line-endings.

In multi-line mode, \d$ will only match a digit followed by an immediate newline (as determined by PCRE's compile-time setting), which might not work if there was a \r hiding in there.

Adding \s at the end of your regex would match all line-ending characters, which explains why that helps in your affected test environments.

For a fix (other than the \s addition you've already found), PCRE lets you adjust which characters are matched as a newline using a modifier at the start of the string, e.g. (*ANYCRLF):

// Force Windows line-ending
<?php
$test = "foo\r\nbar";

var_dump(preg_match_all('/^foo$/m', $test));
var_dump(preg_match_all('/(*ANYCRLF)^foo$/m', $test));

int(0)
int(1)

See https://3v4l.org/vOUgM for a demo, and the Newline Conventions section of the PCRE docs for some detail.

Or alternatively, just use the newline character(s) in your string that PCRE is expecting locally.


And more generally, if you're actually trying to parse the string/file in your question then a combination of array_key_exists and parse_ini_string/parse_ini_file will make everything a lot cleaner.

iainn
  • 16,826
  • 9
  • 33
  • 40
  • 1
    Absolutely impressive and appreciated insight! It's wild to learn that `(*ANYCRLF)` isn't default behavior but I don't think it's feasible to "make the input linebreaks match what PCRE is expecting" given all the possible sources of input. It's much more feasible to bend regex to accept these nuances than expect an end-user to be concerned about my regex shortcomings. I will see if `parse_ini_string()` is viable in my situation. – MonkeyZeus Aug 24 '23 at 14:44
  • Does this mean that `.*` will capture the `\r` when it's present? – MonkeyZeus Aug 24 '23 at 16:55
  • amazing, sir. Something good to learn and remember, in case one happens to face such issue. – OMi Shah Aug 24 '23 at 17:17
  • he posted bin2hex($config) and it confirms that he's indeed dealing with windows-style newlines `\r\n` :) – hanshenrik Aug 25 '23 at 09:16
  • Verified, `.*` does match the `\r`. Albeit likely benign, it *could* have implications when using `preg_replace()` which would cause a conversion from `\r\n` to just `\n`. – MonkeyZeus Aug 25 '23 at 12:30
  • Also, apparently this is documented, see my [answer](https://stackoverflow.com/a/76977165/2191572) =) – MonkeyZeus Aug 25 '23 at 12:53
0

your config file does indeed have windows-newlines \r\n , the first part of your bin2hex 5b6c6f675d0d0a translates to [log]\r\n which means @iainn's hunch is correct :)

Still though, i would have written that regex as

'/^maintenance_mode\s*=\s*(\d)\s*$/m'

it's just more robust that way, then doesn't matter if you write it as

maintenance_mode=5

or

maintenance_mode =5

or

maintenance_mode= 5

or

maintenance_mode = 5

, and doesn't matter if you use spaces or tabs, and doesn't matter what your line endings are.

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • If a newline separates a key=>value pair then is it still valid config/ini syntax? `\s*=\s*` seems too loose. – MonkeyZeus Aug 25 '23 at 12:27
  • @MonkeyZeus you're right, it is too lose, but the alternative would be like `[\ \t\r]*` or something, can't be arsed – hanshenrik Aug 25 '23 at 12:47
  • I guess I fail to see the downside of simply `[ \t]*`. The .00005% chance of it failing would be caused by intentional input malformation and not something achieved via keyboard strokes. – MonkeyZeus Aug 25 '23 at 12:52
0

Self-answering for complete clarification to follow up on iainn's excellent answer

Per https://www.regular-expressions.info/anchors.html

For anchors there's an additional consideration when CR and LF occur as a pair and the regex flavor treats both these characters as line breaks. Delphi, Java, and the JGsoft flavor treat CRLF as an indivisible pair. ^ matches after CRLF and $ matches before CRLF, but neither match in the middle of a CRLF pair. JavaScript and XPath treat CRLF pairs as two line breaks. ^ matches in the middle of and after CRLF, while $ matches before and in the middle of CRLF.

So this means that the compiled PCRE on my systems are default and so are the ones at https://onlinephp.io/c/51407 and https://3v4l.org/vOUgM

However, https://ideone.com/xVSOgy is either compiled differently or the input is being converted CRLF -> LF before execution.

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77