1

I have a website that works with text files that users upload, to make sure that they are actually text files i check the mime type in PHP like this:

$finfo = finfo_open(FILEINFO_MIME_TYPE);
$mimeType = finfo_file($finfo, $filepath);
finfo_close($finfo);

Which works fine most of the time. The problem is that sometimes files are uploaded that contain a few control characters (non-printable characters like nul or stx). Trying to get the mime type of these files always returns application/octet-stream. For example, a textfile i have that is 560 lines long, contains one nul character on line 12, and therefor gets indentified as application/octet-stream

Is there any safe and reliable way to check if an uploaded file is a text file when detecting the mime type doesn't work?

Sjors Ottjes
  • 1,067
  • 2
  • 12
  • 17
  • Define "is text file". You mean even with a NUL byte in it it should be recognised as a text file? Then what *doesn't* qualify as a text file…? – deceze Apr 12 '17 at 10:32
  • For example, if its a pdf, it will start with %PDF- so you can read the first four bytes – Rotimi Apr 12 '17 at 10:32
  • correct me if i'm wrong, but as far as i know, mime type at upload is mostly a clue (depending on file extension i guess) but does not guarantee that content fits with declaration. You probably have to check content after the upload, considering that text file is really generic (binary can be read as text too) – Kaddath Apr 12 '17 at 10:35
  • @Kaddath OP *is* "checking the content" using finfo. – deceze Apr 12 '17 at 10:37
  • @deceze I guess when the vast majority of it is valid text (more than 99%) i would like to process it as a text file, but i'm not sure if that is smart or safe to do – Sjors Ottjes Apr 12 '17 at 10:48
  • @deceze sorry, i should always double check my RTFM.. as a note, the manual says that this function cannot always be trusted, as some code can be wrapped into what can be considered a valid mime type – Kaddath Apr 12 '17 at 10:50
  • @Sjors What is a "text file" exactly…? It's a bunch of bytes which *when interpreted using the right encoding* represent letters which make some sort of sense to humans. Virtually any file fits this definition, except for the "makes sense to humans" part, and that's not really verifiable to a computer. – deceze Apr 12 '17 at 10:54

1 Answers1

0

Turns out most file read functions in php are binary safe, which kind of answers my question of how to safely read a file.

I ended up solving my problem by counting control characters. If a chunk of the file has more than 1% control characters i assume it's not a text file.

The function below works for what i'm using it for (even though it only works with UTF-8 files)

public static function isTextFile($filepath)
{
    $finfo = finfo_open(FILEINFO_MIME_TYPE);
    $mimeType = finfo_file($finfo, $filepath);
    finfo_close($finfo);

    if(substr($mimeType, 0, 5) === "text/") {
        return true;
    }

    if($mimeType !== "application/octet-stream") {
        return false;
    }

    $handle = fopen($filepath, 'rb');

    while (!feof($handle)) {
        $chunk = fread($handle, 4096);
        $controlCharCount = 0;

        if(($length = strlen($chunk)) === 0) {
            continue;
        }

        for($i = 0; $i < $length; $i++) {
            if($chunk[$i] !== "\r" && $chunk[$i] !== "\n" && ctype_cntrl($chunk[$i])) {
                $controlCharCount++;
            }
        }


        if(100 - $controlCharCount / $length * 100 < 99.0) {
            return false;
        }
    }

    fclose($handle);

    return true;
}
Community
  • 1
  • 1
Sjors Ottjes
  • 1,067
  • 2
  • 12
  • 17