17

How can I determine if a string contains non-printable characters/is likely binary data?

This is for unit testing/debugging -- it doesn't need to be exact.

mpen
  • 272,448
  • 266
  • 850
  • 1,236

9 Answers9

17

This will have to do.

function isBinary($str) {
    return preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0;
}
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • 1
    Unfortunately, this doesn't work with non-english western languages, as they include characters like: ñ (spanish), ö ä (swedish), è ê ç (french) and so on... – Ignacio Segura Sep 14 '16 at 17:19
  • @IgnacioSegura Good point. I think it might be better to explicitly define the set of control characters, and enable the `u` flag. – mpen Sep 14 '16 at 18:41
  • 4
    just add in begin of function: if (mb_detect_encoding($str)) return false; – Ruslan Novikov Sep 21 '16 at 12:01
  • @IgnacioSegura It is (apparently) possible to match all characters of all languages; see https://stackoverflow.com/questions/15861088/regex-to-match-only-language-chars-all-language. – John Nov 22 '20 at 03:27
  • This code is inherently flawed as it only catches 62% of the cases in ASCII. It will NOT work with non-ASCII languages. This is, at best, **NON-PRODUCTION CODE**. Also, note, most people do not consider tab/return/linefeed to be printable characters. – Lloyd Sargent Dec 12 '20 at 21:14
5

After a few attempts using ctype_ and various workarounds like removing whitespace chars and checking for empty, I decided I was going in the wrong direction. The following approach uses mb_detect_encoding (with the strict flag!) and considers a string as "binary" if the encoding cannot be detected.

So far i haven't found a non-binary string which returns true, and the binary strings that return false only do so if the binary happens to be all printable characters.

/**
 * Determine whether the given value is a binary string by checking to see if it has detectable character encoding.
 *
 * @param string $value
 *
 * @return bool
 */
function isBinary($value): bool
{
    return false === mb_detect_encoding((string)$value, null, true);
}
Harry Lewis
  • 488
  • 5
  • 12
2

To search for non-printable characters, you can use ctype_print (http://php.net/manual/en/function.ctype-print.php).

GuiTeK
  • 1,561
  • 5
  • 20
  • 39
  • @MrTux: Well then combine it with a check for `ctype_space` … – CBroe Aug 16 '14 at 20:21
  • @CBroe *Can* it be combined? `ctype_print($x) || ctype_space($x)` won't work. They both check against the entire string. – mpen Aug 16 '14 at 20:23
  • There's a slight catch: `ctype_print()` only works reliably with ASCII strings. If you pass it a string that contains non-ASCII characters, it may return unexpected results. Non-ASCII characters include accented latin characters, such as á, greek, chinese, etc – Lucas Bustamante Aug 02 '23 at 00:10
2

From Symfony database debug tool:

if (!preg_match('//u', $params[$index])) // the string is binary

Detect if a string contains non-Unicode characters.

TimSparrow
  • 899
  • 11
  • 21
  • What does that *not* match? It matches `"hello"` and `"\x00"` and empty strings and everything else I've tried. – mpen Dec 22 '20 at 22:38
  • It means that a string contains non-unicode characters. Worked for me - it detects 'text' files pulled from an external source that are not text, i.e. contains characters that cannot be entered into mysql text/longtext field. Original purpose: when a database query is exported for logging/debug, it displays "(binary data)" instead of original content, to keep logs readable. Perhaps "binary" is not clearly defined , so several incompatible solutions may exist. – TimSparrow Dec 23 '20 at 12:09
  • Could you give an example string that returns `0`? – mpen Dec 23 '20 at 23:42
  • @mpen http://tuobenessere.it/ads.txt. For obvious reasons, I cannot provide it quoted, the browser/formatter will remove the offending character. – TimSparrow Dec 24 '20 at 20:56
  • `preg_match('//u', hex2bin('a670c89d4a324e47'))` Ahah..well that returns `false`. I wonder what's special about that string. – mpen Dec 24 '20 at 22:54
1

I have studied all answers to this question, and ended up with a different solution.

  • The accepted answer preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0 flags non-ASCII characters as binary, this includes latin accents, chinese, russian, greek, hebrew, arabic, etc.
  • ctype_print has the same problem as the above.
  • strpos($string, "\0")===FALSE is almost good, but you can have binary data without null characters.
  • preg_match('//u', $params[$index]) is almost identical to the solution I ended up using, but it might throw a warning when dealing with binary data, eg: Compilation failed: invalid UTF-8 string at offset 1, although I haven't been able to replicate this warning.

Detecting whether a string is binary is a fuzzy detection by nature, as there isn't a specification that specifies what is binary what is not. There is no control characters that we can look for.

What we can do is look for bytes that do not represent a meaningful character in any language.

With that in mind, the most efficient way seems to be to check for UTF-8 compliance on the string:

protected function isBinary(string $data): bool
{        
    return ! mb_check_encoding($data, 'UTF-8');
}

I have written unit tests and it has correctly detected everything so far:

  • ASCII
  • Latin
  • Chinese
  • Greek
  • Hebrew
  • Russian
  • Arabic
  • Japanese

And correctly detected the binaries I used in the unit tests.

3v4l

Lucas Bustamante
  • 15,821
  • 7
  • 92
  • 86
0

A hacky solution (which I have seen quite often) would be to search for NUL \0 chars.

if (strpos($string, "\0")===FALSE) echo "not binary";

A more sophisticated approach would be to check if the string contains valid unicode.

MrTux
  • 32,350
  • 30
  • 109
  • 146
  • 2
    That's not quite good enough. Many binary strings won't contain a NUL byte. – mpen Aug 16 '14 at 20:24
  • Yeah, but it's a good indicator. Just checking for unprintable chars (as tabs won't help you, too). – MrTux Aug 16 '14 at 21:24
0

I would use a simple ctype_print. It works for me:

public function is_binary(string $string):bool
{
    if(!ctype_print($string)){
        return true;
    }

    return false
}
-1

My assumption is that what the OP wants to do is the following:

$hex = hex2bin(“0588196d706c65206865782064617461”);
// how to determine if $hex is a BINARY string or a CHARACTER string?

Yeah, this is not possible. Let’s look at WHY:

$string = “1234”

In binary this would be 31323334. Guess what you get when you do the following?

hex2bin(‘31323334’) == ‘1234’

You get true. But wait, you may be saying, I specified the BINARY and it should be the BINARY 0x31 0x32 0x33 0x34! Yeah, but PHP doesn’t know the difference. YOU know the difference, but how is PHP going to figure it out?

If the idea is to test for non-printable because reasons, that’s quite different. But no amount of Regex voodoo will allow the code to magically know that YOU want to think of this as a string of binary.

Lloyd Sargent
  • 599
  • 4
  • 13
  • Yeah.. I was and am aware of this fact, but it's good to point out for others :-) I think the probability of at least 1 non-printable char is pretty good for a long enough "binary" string though. Again, it was just for debugging so that I could auto-convert to hex or something instead of printing jibberish. – mpen Dec 10 '20 at 21:29
  • The odds are 0.578125% that a character will be non-printable. That probability remains true for each byte no matter the length. Worse, it fails with non-ASCII languages. My point was that this is bad practice and should NEVER be used for production code. I would mark your answer as such. – Lloyd Sargent Dec 11 '20 at 22:20
  • That doesn't sound right at all. My answer claims 158/255 chars as non-printable which is 62%. Given a randomly distributed 16-byte string, the odds are near 100% that isBinary will return true. Where did you come up with your figure? And again, this isn't "production" code, it's "something went wrong and I want to echo that value to the terminal so I can see what it was" code. – mpen Dec 12 '20 at 00:17
  • My bad. I combined hex x20 with decimal 128 XD … SO 32 characters (0x00-0x1F) + DEL = 33 unprintable characters in ASCII (tab/return/linefeed are seldom considered printable characters, but to each his own). Add 128 = 161 unprintable. 161/256 = 0.6289% chance it will be unprintable. No. The odds are NOT 100% for 16 characters. It’s better odds than Las Vegas, but people in Vegas still win. Your `isBinary` WILL fail. – Lloyd Sargent Dec 12 '20 at 21:10
  • 2
    62% not 0.62%, very different. And yes, over 99% with just 5 bytes. I used a calculator https://www.omnicalculator.com/statistics/probability – mpen Dec 13 '20 at 10:26
  • Yes, 62% for each character and no, it will fail. Anything other than 100% mean *it will fail at some point* — so it should **never** be used in production code. Betting on the odds means that you are not designing code, you are designing bets. – Lloyd Sargent Dec 13 '20 at 20:06
  • Well now we're getting into philosophy. What about every AI-powered thing in existence? Not one of those is perfect. Should they not be deployed into production? What about password hashes, should we not use those in prod, because there's a theoretical chance of collision? – mpen Dec 13 '20 at 20:39
  • It’s statistics. If you have a almost 38% chances of INCORRECTLY detecting a string, that’s not a very good test, **especially** in a test environment. As I indicated before, you aren’t designing good code, you are designing a BET — one that you or someone else will lose. – Lloyd Sargent Jan 13 '21 at 17:22
  • Flaky tests are bad, I agree. Again, I don't know how many times I have to re-iterate this is simply for `echo` debugging. Nothing more. No production code depends on it, and no test results depend on it. It can be wrong 100% of the time and nothing bad will happen. – mpen Jan 13 '21 at 19:23
  • My point has always been if this is for home stuff, I don’t really care. But people from the professional world also look at StackOverflow. I’m not trying to be pedantic I’m trying to point out **best practices**. If you wish to ignore them, fine. – Lloyd Sargent Jan 13 '21 at 20:30
-2

TRy a reg exp replace, replacing '[:print:]' with "", and if the result is "" then it contains only printable characters, else it contains non-printable characters as well.

TenG
  • 3,843
  • 2
  • 25
  • 42