4

I have the following PHP code that shows the mime type of an uploaded file.

<?php

if ($_POST) {

    var_dump($_FILES);

    $finfo = new finfo(FILEINFO_MIME_TYPE);

    var_dump($finfo->file($_FILES['file']['tmp_name']));

} else{
    ?>
    <form method="POST" enctype="multipart/form-data"><input name="file" type="file"><input name="submit" value="send" type="submit"/></form>
    <?php
}

The result of uploading somefile.csv with this script is as follows.

array (size=1)
    'file' =>
    array (size=5)
        'name' => string 'somefile.csv' (length=12)
        'type' => string 'text/csv' (length=8)
        'tmp_name' => string '/tmp/phpKiwqtu' (length=14)
        'error' => int 0
        'size' => int 3561
string 'text/x-fortran' (length=14)

So of course the mime type should be text/csv. But the framework I use (Symfony 1.4) uses the method with fileinfo.

Also I tested a little further it seems that the command (on Ubuntu) file --mime-type somefile.csv returns somefile.csv: text/x-fortran and the command mimetype somefile.csv returns somefile.csv: text/csv. somefile.csv is created with MSOffice (I don't know if this matters). Apparently mimetype uses some awesome mime database (http://freedesktop.org/wiki/Software/shared-mime-info), while file does not.

  1. Does PHP use file or mimetype or neither?
  2. Further, I am not sure what to do here; is my uploaded file wrongly formatted? Do I have to use a different mime database? Is PHP bugged? What is going on here?

edit:

The reason why it is detected as a fortran program is because somefile.csv contains only the following:

somecolumn;
C F;

I believe the above contents of a CSV file is valid right? If a field contains a space this field does not have to be put inside quotes, right?

meijuh
  • 1,067
  • 1
  • 9
  • 23

2 Answers2

6

I don't have a Unix box here to inspect a real "magic" file (the signatures database used to guess mime types) but a quick Google search revealed this:

# $File: fortran,v 1.6 2009/09/19 16:28:09 christos Exp $
# FORTRAN source
0       regex/100       \^[Cc][\ \t]    FORTRAN program
!:mime  text/x-fortran

Apparently, it scans the start of the file looking for lines that begin with a single C letter plus spaces, which seem to be a Fortran style comment. Thus the false positive:

somecolumn;
C F;
Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • So, how should I handle false positives? I know that a solution is to put quotes around every cell, but that is not really what I want, since the users of my webapplication upload these CSV files. And the example showed is a valid CSV file. – meijuh Apr 24 '13 at 15:47
  • In depends on your exact needs but, in this situation, it's probably better to use file extension as well. You could also remove Fortran for your mime file. (Not sure why you use heuristics here if you already know it's CSV; guessing the mime type won't validate the file) – Álvaro González Apr 24 '13 at 15:49
  • Well the CSV file is uploaded by a user of the application. If false positives are results of guessing mime types then it does not really make sense to use mime type guessing. I'll just make sure the file is not executable in a public folder and users should be aware of what they are downloading. Also since I am using only CSV files and the syntax of the CSV file must be correct I can also check the contents of a CSV file with its BNF syntax. – meijuh Apr 24 '13 at 18:00
0

From PHP Mimetype introduction:

This extension has been deprecated as the PECL extension Fileinfo provides the same functionality (and more) in a much cleaner way.

The functions in this module try to guess the content type and encoding of a file by looking for certain magic byte sequences at specific positions within the file. While this is not a bullet proof approach the heuristics used do a very good job.

This extension is derived from Apache mod_mime_magic, which is itself based on the file command maintained by Ian F. Darwin. See the source code for further historic and copyright information.

From PHP Fileinfo introduction:

The functions in this module try to guess the content type and encoding of a file by looking for certain magic byte sequences at specific positions within the file. While this is not a bullet proof approach the heuristics used do a very good job.

Here's a question with some answers on the same subject: Detecting MIME type in PHP.

Community
  • 1
  • 1
Rolando Isidoro
  • 4,983
  • 2
  • 31
  • 43
  • http://pear.php.net/package/MIME_Type gives the same result als file_info. I don't understand why a CSV file appears to be an fortran file. – meijuh Apr 24 '13 at 11:57
  • Looking at Fortran code examples I can't figure out why that's happening, they're completely different. If you open that particular CSV file in a simple text editor does it look like plain CSV or does it have other elements that might lead to that mix-up result? – Rolando Isidoro Apr 24 '13 at 13:15
  • Another 5 cents, I googled for well established PHP based web apps and here's other approach: Drupal 8 seems to use Guzzle PHP framework to do the job, take a look at their code at https://github.com/guzzle/guzzle/blob/master/src/Guzzle/Http/Mimetypes.php. They just do a simple extension check against a list of pre-defined known mime-types. Not bulletproof either I'd say. – Rolando Isidoro Apr 24 '13 at 13:21
  • I edited my initial post. I found a minimal amount of content for the CSV-file to make it look like fortran code. I also believe that the content is valid for a CSV file. What to do with it? – meijuh Apr 24 '13 at 15:30
  • From the look of your file content I wouldn't tell it was a CSV file, as it doesn't follow the [RFC 4180 definition](http://tools.ietf.org/html/rfc4180). That is more like a "space separated values ending with a semicolon'. You can read some considerations regarding the lack of a standard format on CSV files on [Wikipedia](http://en.wikipedia.org/wiki/Comma-separated_values#Lack_of_a_standard). – Rolando Isidoro Apr 24 '13 at 15:38
  • 2.4 states `Spaces are considered part of a field and should not be ignored.` I believe this is valid CSV format. I don't think that a field which contains a space should have quotes around it. Also if I open the file with LibreOffice and then save it again as a different CSV file it also does not put quotes around the cell. – meijuh Apr 24 '13 at 15:42