How to Check if File is ASCII or Binary in PHP

Question

Is there a quick, simple way to check if a file is ASCII or binary with PHP?

This has been asked before, but I always wonder, why do you care if its ascii or binary? — Pyrolistical, Mar 10 '09 at 23:28
Similar, but not a duplicate. This one has an easy, technical answer, whereas the supposed identical question is quite harder. There's a big difference between asking whether a file is in encoding X, or in any encoding at all. — Devin Jeanpierre, Mar 10 '09 at 23:32
Nope read it again, those types were only examples. He's looking for the same thing binary vs text — Pyrolistical, Mar 10 '09 at 23:37
It's not a duplicate, since that is a general question (more a theoretical question), and this is for a specific language (practical use). In any case, what I ended up doing is below. — davethegr8, Mar 11 '09 at 00:02
@Pyrolistical: to check if uploaded.avi is something other, as checking mime doesn't seem working well enough. — Leo, Apr 08 '13 at 16:28
@davethegr8 I know this is a very old question, but is there any chance you would be willing to review your selected answer? — Brogan, Feb 21 '21 at 19:40

score 24 · Accepted Answer · edited Oct 10 '11 at 11:56

24

This only works for PHP>=5.3.0, and isn't 100% reliable, but hey, it's pretty darn close.

// return mime type ala mimetype extension
$finfo = finfo_open(FILEINFO_MIME);

//check to see if the mime-type starts with 'text'
return substr(finfo_file($finfo, $filename), 0, 4) == 'text';

http://us.php.net/manual/en/ref.fileinfo.php

edited Oct 10 '11 at 11:56

Shabbyrobe

12,298
15
60
87

answered Mar 11 '09 at 00:08

davethegr8

11,323
5
36
61

2

probably should check `if (!$finfo){ echo "Opening fileinfo database failed"; exit(); }` and don't forget to: `finfo_close($finfo);`... – Feb 21 '15 at 00:42
Would this not fail for application/javascript ? – Shivanand Sharma Oct 15 '19 at 16:13
This tells you if the file contains only printable characters, it does not tell you if the file is ASCII or binary. – Brogan Mar 26 '20 at 05:37

score 4 · Answer 2 · answered Mar 10 '09 at 23:27

Since ASCII is just an encoding for text, with binary representation, not really. You could check that all bytes are less than 128, but even this wouldn't guarantee that it was intended to be decoded as ASCII. For all you know it's some crazy image format, or an entirely different text encoding that also has no use of all eight bits. It might suffice for your use, though. If you just want to check if a file is valid ASCII, even if it's not a "text file", it will definitely suffice.

score 3 · Answer 3 · answered Jul 16 '13 at 15:22

3

You should probably check the file's mimetype, but if you're willing to load the file into memory, maybe you could check to see if the buffer consists of all-printable-characters using something like:

<?php
$probably_binary = (is_string($var) === true && ctype_print($var) === false);

Not perfect, but might be helpful in some cases.

answered Jul 16 '13 at 15:22

kvz

5,517
1
42
33

5

Tabs and carriage returns will make `ctype_print()` return FALSE, unfortunately. – dotancohen Nov 24 '13 at 14:34

MarcoA · Answer 4 · 2018-05-30T15:28:44.447

this way it seems ok in my project:

function probably_binary($stringa) {
    $is_binary=false;
    $stringa=str_ireplace("\t","",$stringa);
    $stringa=str_ireplace("\n","",$stringa);
    $stringa=str_ireplace("\r","",$stringa);
    if(is_string($stringa) && ctype_print($stringa) === false){
        $is_binary=true;
    }
    return $is_binary;
}

PS: sorry, my first post, I wanted to add a comment to previous one :)

Brogan · Answer 5 · 2021-10-14T16:44:08.420

In one of my older PHP projects I use ASCII / Binary compression. When the user uploads their file, they are required to specify if that file is ASCII or Binary. I decided to modify my code to have the server automatically decide what the file mode is, as relying on the user's decision could result in a failed compression. I decided my code has to be absolute, and not use tricks that would potentially cause my program to fail. I quickly whipped up some code, ran some speed tests and then decided to search the internet to see if there is a faster code example to complete this task.

Devin's very vague answer relates to the first code I wrote to complete this task. The results were so-so. I found that searching byte for byte was in many cases faster for binary files. If you find a byte larger than 127, the rest of the file could be ignored and the entire file is considered a binary file. That being said, you would have to read every last byte of a file to determine if the file is ASCII. It appears faster for many binary files because a binary byte will likely come earlier than the very last byte of the file, sometimes even the very first byte would be binary.

<?php
$filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'ASCII',
    2 => 'Binary'
);

function filemode($filename) {
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            $handle = fopen($filename, 'rb');
            for($i = 0; $i < $size; ++$i) {
                $byte = fread($handle, 1);
                if(ord($byte) > 127) {
                    fclose($handle);
                    return 2; // Binary
                }
            }
            fclose($handle);
            return 1; // ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

// ==========

$filename = 'e:\test.txt';

$loops = 1;
$x = 0;
$i = 0;
$start = microtime(true);

for($i = 0; $i < $loops; ++$i)
    $x = filemode($filename);

$stop = microtime(true);
$duration = $stop - $start;

echo
    'Filename: ', $filename, "\n",
    'Filemode: ', $filemodes[filemode($filename)], "\n",
    'Duration: ', $duration;

My processor isn't exactly modern but I found that a 600Kb ASCII file would take about 0.25 seconds to complete. If I were to use this on hundreds or thousands of large files it might take a very long time. I decided to try and speed things up a bit by making my buffer larger than a single byte to read the file as chunks instead of one byte at a time. Using chunks will allow me to process more of the file at once but not load too much into memory. If a file we're testing is huge and we were to load the entire file into memory, it could use up far too much memory and cause the program to fail.

<?php
$filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'ASCII',
    2 => 'Binary'
);

function filemode($filename) {
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            $buffer_size = 256;
            $chunks = ceil($size / $buffer_size);
            $handle = fopen($filename, 'rb');
            for($chunk = 0; $chunk < $chunks; ++$chunk) {
                $buffer = fread($handle, $buffer_size);
                $buffer_length = strlen($buffer);
                for($byte = 0; $byte < $buffer_length; ++$byte) {
                    if(ord($buffer[$byte]) > 127) {
                        fclose($handle);
                        return 2; // Binary
                    }
                }
            }
            fclose($handle);
            return 1; // ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

// ==========

$filename = 'e:\test.txt';

$loops = 1;
$x = 0;
$i = 0;
$start = microtime(true);

for($i = 0; $i < $loops; ++$i)
    $x = filemode($filename);

$stop = microtime(true);
$duration = $stop - $start;

echo
    'Filename: ', $filename, "\n",
    'Filemode: ', $filemodes[filemode($filename)], "\n",
    'Duration: ', $duration;

The difference in speed was fairly significant taking only 0.15 seconds instead of the 0.25 seconds of the previous function, almost a tenth of a second faster to read my 600Kb ASCII file.

Now that I have my file in chunks, I thought it would be a good idea to find alternative ways to test my chunks for binary characters. My first thought would be to use a regular expression to find non-ascii characters.

<?php
$filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'ASCII',
    2 => 'Binary'
);

function filemode($filename) {
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            $buffer_size = 256;
            $chunks = ceil($size / $buffer_size);
            $handle = fopen($filename, 'rb');
            for($chunk = 0; $chunk < $chunks; ++$chunk) {
                $buffer = fread($handle, $buffer_size);
                if(preg_match('/[\x80-\xFF]/', $buffer) === 1) {
                    fclose($handle);
                    return 2; // Binary
                }
            }
            fclose($handle);
            return 1; // ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

// ==========

$filename = 'e:\test.txt';

$loops = 1;
$x = 0;
$i = 0;
$start = microtime(true);

for($i = 0; $i < $loops; ++$i)
    $x = filemode($filename);

$stop = microtime(true);
$duration = $stop - $start;

echo
    'Filename: ', $filename, "\n",
    'Filemode: ', $filemodes[filemode($filename)], "\n",
    'Duration: ', $duration;

Amazing! 0.02 seconds to consider my 600Kb file an ASCII file and this code appears to be 100% reliable.

Now that I have arrived here, I have the opportunity to inspect several other methods deployed by other users.

The most accepted answer today, written by davethegr8 uses the mimetype extension. First, I was required to enable this extension in the php.ini file. Next, I tested this code against an actual ASCII file that has no file extension and a binary file that has no file extension.

Here is how I created my two test files.

<?php
$handle = fopen('E:\ASCII', 'wb');
for($i = 0; $i < 128; ++$i) {
    fwrite($handle, chr($i));
}
fclose($handle);

$handle = fopen('E:\Binary', 'wb');
for($i = 0; $i < 256; ++$i) {
    fwrite($handle, chr($i));
}
fclose($handle);

Here is how I tested both files...

<?php
$filename = 'E:\ASCII';
$finfo = finfo_open(FILEINFO_MIME);
echo (substr(finfo_file($finfo, $filename), 0, 4) == 'text') ? 'ASCII' : 'Binary';

Which outputs:

Binary

and...

<?php
$filename = 'E:\Binary';
$finfo = finfo_open(FILEINFO_MIME);
echo (substr(finfo_file($finfo, $filename), 0, 4) == 'text') ? 'ASCII' : 'Binary';

Which outputs:

Binary

This code shows both my ASCII and binary files to both be binary, which is obviously incorrect, so I had to find what was causing the mimetype to be "text". To me it was obvious, maybe text is just printable ASCII characters. So I limited the range of my ASCII file.

<?php
$handle = fopen('E:\ASCII', 'wb');
for($i = 32; $i < 127; ++$i) {
    fwrite($handle, chr($i));
}
fclose($handle);

And tested it again.

<?php
$filename = 'E:\ASCII';
$finfo = finfo_open(FILEINFO_MIME);
echo (substr(finfo_file($finfo, $filename), 0, 4) == 'text') ? 'ASCII' : 'Binary';

Which outputs:

ASCII

If I lower the range, it treats it as binary. If I increase the range, once again, it treats it as binary.

So the most accepted answer does not tell you if your file is ASCII but rather that it contains only readable text or not.

Lastly, I need to test the other answer which uses ctype_print against my files. I decided the easiest way to do this was to use the code I made and supplement in MarcoA's code.

<?php
$filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'ASCII',
    2 => 'Binary'
);

function filemode($filename) {
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            $buffer_size = 256;
            $chunks = ceil($size / $buffer_size);
            $handle = fopen($filename, 'rb');
            for($chunk = 0; $chunk < $chunks; ++$chunk) {
                $buffer = fread($handle, $buffer_size);
                $buffer = str_ireplace("\t", '', $buffer);
                $buffer = str_ireplace("\n", '', $buffer);
                $buffer = str_ireplace("\r", '', $buffer);
                if(ctype_print($buffer) === false) {
                    fclose($handle);
                    return 2; // Binary
                }
            }
            fclose($handle);
            return 1; // ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

// ==========

$filename = 'e:\test.txt';

$loops = 1;
$x = 0;
$i = 0;
$start = microtime(true);

for($i = 0; $i < $loops; ++$i)
    $x = filemode($filename);

$stop = microtime(true);
$duration = $stop - $start;

echo
    'Filename: ', $filename, "\n",
    'Filemode: ', $filemodes[filemode($filename)], "\n",
    'Duration: ', $duration;

Ouch! 0.2 seconds to tell me that my 600Kb file is ASCII. My large ASCII file, I know, contains visible ASCII characters only. It does seem to know that my binary files are binary. And my pure ASCII file... Binary!

I decided to read the documentation for ctype_print and its return value is defined as:

Returns TRUE if every character in text will actually create output (including blanks). Returns FALSE if text contains control characters or characters that do not have any output or control function at all.

This function, like davethegr8's answer only tells you if your text contains printable ASCII characters and does not tell you if your text is actually ASCII or not. That doesn't necessarily mean MacroA is completely wrong, they are just not completely right. str_ireplace is slow compared to str_replace, and only replacing those three control characters to test ctype_print isn't enough to know if the string is ASCII or not. To make this example work for ASCII, we must replace every control character!

<?php
$filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'ASCII',
    2 => 'Binary'
);

function filemode($filename) {
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            $buffer_size = 256;
            $chunks = ceil($size / $buffer_size);
            $replace = array(
                "\x00", "\x01", "\x02", "\x03",
                "\x04", "\x05", "\x06", "\x07",
                "\x08", "\x09", "\x0A", "\x0B",
                "\x0C", "\x0D", "\x0E", "\x0F",
                "\x10", "\x11", "\x12", "\x13",
                "\x14", "\x15", "\x16", "\x17",
                "\x18", "\x19", "\x1A", "\x1B",
                "\x1C", "\x1D", "\x1E", "\x1F",
                "\x7F"
            );
            $handle = fopen($filename, 'rb');
            for($chunk = 0; $chunk < $chunks; ++$chunk) {
                $buffer = fread($handle, $buffer_size);
                $buffer = str_replace($replace, '', $buffer);
                if(ctype_print($buffer) === false) {
                    fclose($handle);
                    return 2; // Binary
                }
            }
            fclose($handle);
            return 1; // ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

This took 0.04 seconds to tell me that my 600Kb file is ASCII.

All of this testing I believe hasn't been completely useless as it did give me one more idea. Why not add a printable filemode to my original function! While it does seem to be 0.018 seconds slower on my 600Kb printable ASCII file, here it is.

<?php
$filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'Printable',
    2 => 'ASCII',
    3 => 'Binary'
);

function filemode($filename) {
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            $printable = true;
            $buffer_size = 256;
            $chunks = ceil($size / $buffer_size);
            $handle = fopen($filename, 'rb');
            for($chunk = 0; $chunk < $chunks; ++$chunk) {
                $buffer = fread($handle, $buffer_size);
                if(preg_match('/[\x80-\xFF]/', $buffer) === 1) {
                    fclose($handle);
                    return 3; // Binary
                }
                else
                    if($printable === true)
                        $printable = ctype_print($buffer);
            }
            fclose($handle);
            return $printable === true ? 1 : 2; // Printable or ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

// ==========

$filename = 'e:\test.txt';

$loops = 1;
$x = 0;
$i = 0;
$start = microtime(true);

for($i = 0; $i < $loops; ++$i)
    $x = filemode($filename);

$stop = microtime(true);
$duration = $stop - $start;

echo
    'Filename: ', $filename, "\n",
    'Filemode: ', $filemodes[filemode($filename)], "\n",
    'Duration: ', $duration;

I also tested ctype_print against a regular expression and found ctype_print to be a bit faster.

$printable = preg_match('/[^\x20-\x7E]/', $buffer) === 0;

Here is my final function where finding printable text is optional, as is the buffer size.

<?php
const filemodes = array(
    -2 => 'Unreadable',
    -1 => 'Missing',
    0 => 'Empty',
    1 => 'Printable',
    2 => 'ASCII',
    3 => 'Binary'
);

function filemode($filename, $printable = false, $buffer_size = 256) {
    if(is_bool($printable) === false || is_int($buffer_size) === false)
        return false;
    $buffer_size = floor($buffer_size);
    if($buffer_size <= 0)
        return false;
    if(is_file($filename)) {
        if(is_readable($filename)) {
            $size = filesize($filename);
            if($size === 0)
                return 0; // Empty
            if($buffer_size > $size)
                $buffer_size = $size;
            $chunks = ceil($size / $buffer_size);
            $handle = fopen($filename, 'rb');
            for($chunk = 0; $chunk < $chunks; ++$chunk) {
                $buffer = fread($handle, $buffer_size);
                if(preg_match('/[\x80-\xFF]/', $buffer) === 1) {
                    fclose($handle);
                    return 3; // Binary
                }
                else
                    if($printable === true)
                        $printable = ctype_print($buffer);
            }
            fclose($handle);
            return $printable === true ? 1 : 2; // Printable or ASCII
        }
        else
            return -2; // Unreadable
    }
    else
        return -1; // Missing
}

// ==========

$filename = 'e:\test.txt';
echo
    'Filename: ', $filename, "\n",
    'Filemode: ', filemodes[filemode($filename, true)], "\n";

Working on a malware scanner and I can't risk any false detections. These days malware hides in jpg and ico extensions too. I'm really hoping I could use some code from here to skip files that are certainly binary. Could this use file_get_contents? — Shivanand Sharma, Dec 21 '21 at 16:38
`Bio-Bäckerei Onder de Linden` A plain text file containing the above string is flagged as binary. — Shivanand Sharma, Dec 21 '21 at 17:08
Shivanand Sharma, that's because the above string is binary and requires all 8 bits, it's not plain text. — Brogan, Dec 22 '21 at 17:28
Thank you. So is there a way to differentiate between files containing such characters and the ones containing chars found in binary files? I think mb_check_encoding would do the job but mbstring extension is not installed on PHP by default. — Shivanand Sharma, Dec 26 '21 at 13:57
Shivanand, my entire post is devoted to solving this problem. I'm not sure why you're asking in the comments. — Brogan, Dec 27 '21 at 16:34
@ShivanandSharma For checking valid UTF-8 you can use `//u`. Brogan, a chunk size of `256` is smaller than any HDD block size (even a 25 years old one); `4096` or `8192` would be a more sensible default value. — Fravadona, Jul 03 '23 at 20:56

How to Check if File is ASCII or Binary in PHP

5 Answers5

Linked

Related