2

I have a regex designed to detect plausible Base64 strings. It works in tests at https://regex101.com for all expected test values.

~^((?:[a-zA-Z0-9/+]{4})*(?:(?:[a-zA-Z0-9/+]{3}=)|(?:[a-zA-Z0-9/+]{2}==))?)$~

However, when I use this pattern in PHP, I find some values inexplicably fail.

$tests = array(
    'MFpGQkVBJTNkJTNkfTxCUj4NCg0KICAgIDwvZm9udD4=',
    'MFpGRkVBJTNkJTNkfTxCUj4NCg0KICAgIDwvZm9udD4=',
    'MFpGSkVBJTNkJTNkfTxCUj4NCg0KICAgIDwvZm9udD4=',
);

foreach ($tests as $str) {
    $result = preg_match(
        '~^((?:[a-zA-Z0-9/+]{4})*(?:(?:[a-zA-Z0-9/+]{3}=)|(?:[a-zA-Z0-9/+]{2}==))?)$~i',
        preg_replace('~[\s\R]~u', "", $str)
    );

    var_dump($result);
}

results:

int(1)
int(0)
int(1)

Question: Why does this pattern fail for the second test string?

Umbrella
  • 4,733
  • 2
  • 22
  • 31
  • Maybe you don't need RegEx for this :) there is a good answer on http://stackoverflow.com/questions/2556345/detect-base64-encoding-in-php (second answer) – Marc Mar 18 '15 at 20:01
  • what's the purpose of that `preg_replace`? – axblount Mar 18 '15 at 20:01
  • `preg_replace` to eliminate the whitespace (newlines) so common in blocks of base64. – Umbrella Mar 18 '15 at 20:25
  • @Marc, I'm actually going to do a sanity check after conditionally decoding, similar to that, but I want to pre-test also. It's critical we don't get a false positive. – Umbrella Mar 18 '15 at 20:30

1 Answers1

4

Problem is in your preg_replace call:

preg_replace('~[\s\R]~u', "", $str)

Inside character class \R is matching and removing literal R from 2nd element in array and thus causing preg_match to fail.

Change it to:

preg_replace('~\s|\R~u', "", $str)

As \s will also match \R you can just do:

preg_replace('~\s+~u', "", $str)
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Yes, that's it. I wasn't aware that `\R` would break down in a character class. Do you know why that is, or, which shorthands work and don't work in character classes? – Umbrella Mar 18 '15 at 20:27
  • 1
    [See here](http://perldoc.perl.org/perlrecharclass.html#Backslash-sequences) It says: **`\R` matches anything that can be considered a newline under Unicode rules. It's not a character class, as it can match a multi-character sequence. Therefore, it cannot be used inside a bracketed character class** – anubhava Mar 18 '15 at 21:27