How can I determine if a string contains non-printable characters/is likely binary data?
This is for unit testing/debugging -- it doesn't need to be exact.
How can I determine if a string contains non-printable characters/is likely binary data?
This is for unit testing/debugging -- it doesn't need to be exact.
This will have to do.
function isBinary($str) {
return preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0;
}
After a few attempts using ctype_ and various workarounds like removing whitespace chars and checking for empty, I decided I was going in the wrong direction. The following approach uses mb_detect_encoding (with the strict flag!) and considers a string as "binary" if the encoding cannot be detected.
So far i haven't found a non-binary string which returns true, and the binary strings that return false only do so if the binary happens to be all printable characters.
/**
* Determine whether the given value is a binary string by checking to see if it has detectable character encoding.
*
* @param string $value
*
* @return bool
*/
function isBinary($value): bool
{
return false === mb_detect_encoding((string)$value, null, true);
}
To search for non-printable characters, you can use ctype_print
(http://php.net/manual/en/function.ctype-print.php).
From Symfony database debug tool:
if (!preg_match('//u', $params[$index])) // the string is binary
Detect if a string contains non-Unicode characters.
I have studied all answers to this question, and ended up with a different solution.
preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0
flags non-ASCII characters as binary, this includes latin accents, chinese, russian, greek, hebrew, arabic, etc.ctype_print
has the same problem as the above.strpos($string, "\0")===FALSE
is almost good, but you can have binary data without null characters.preg_match('//u', $params[$index])
is almost identical to the solution I ended up using, but it might throw a warning when dealing with binary data, eg: Compilation failed: invalid UTF-8 string at offset 1
, although I haven't been able to replicate this warning.Detecting whether a string is binary is a fuzzy detection by nature, as there isn't a specification that specifies what is binary what is not. There is no control characters that we can look for.
What we can do is look for bytes that do not represent a meaningful character in any language.
With that in mind, the most efficient way seems to be to check for UTF-8 compliance on the string:
protected function isBinary(string $data): bool
{
return ! mb_check_encoding($data, 'UTF-8');
}
I have written unit tests and it has correctly detected everything so far:
And correctly detected the binaries I used in the unit tests.
A hacky solution (which I have seen quite often) would be to search for NUL \0
chars.
if (strpos($string, "\0")===FALSE) echo "not binary";
A more sophisticated approach would be to check if the string contains valid unicode.
I would use a simple ctype_print. It works for me:
public function is_binary(string $string):bool
{
if(!ctype_print($string)){
return true;
}
return false
}
My assumption is that what the OP wants to do is the following:
$hex = hex2bin(“0588196d706c65206865782064617461”);
// how to determine if $hex is a BINARY string or a CHARACTER string?
Yeah, this is not possible. Let’s look at WHY:
$string = “1234”
In binary this would be 31323334. Guess what you get when you do the following?
hex2bin(‘31323334’) == ‘1234’
You get true
. But wait, you may be saying, I specified the BINARY and it should be the BINARY 0x31 0x32 0x33 0x34! Yeah, but PHP doesn’t know the difference. YOU know the difference, but how is PHP going to figure it out?
If the idea is to test for non-printable because reasons, that’s quite different. But no amount of Regex voodoo will allow the code to magically know that YOU want to think of this as a string of binary.
TRy a reg exp replace, replacing '[:print:]' with "", and if the result is "" then it contains only printable characters, else it contains non-printable characters as well.