How to accurately determine mime data from a file?

Question

I'm adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I've already tried a few methods:

Method 1:

javax.activation.FileDataSource

FileDataSource ds = new FileDataSource("~\\Downloads\\777135_new.xls");  
String contentType = ds.getContentType();  
System.out.println("The MIME type of the file is: " + contentType);

//output = The MIME type of the file is: application/octet-stream

Method 2:

import net.sf.jmimemagic.*;

try
{
    RandomAccessFile f = new RandomAccessFile("~\\Downloads\\777135_new.xls", "r");
    byte[] fileBytes = new byte[(int)f.length()];
    f.read(fileBytes);
    MagicMatch match = Magic.getMagicMatch(fileBytes);
    System.out.println("The Mime type is: " + match.getMimeType());
}
catch(Exception e)
{
    System.out.println(e);
}

//output = The Mime type is: application/msword

Method 3:

import eu.medsea.mimeutil.*;

MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
File f = new File ("~\\Downloads\\777135_new.xls");
Collection<?> mimeTypes = MimeUtil.getMimeTypes(f);
String mimeType = MimeUtil.getFirstMimeType(mimeTypes.toString()).toString();
String subMimeType = MimeUtil.getSubType(mimeTypes.toString());
System.out.println("The Mime type is: " + mimeTypes + ", " + mimeType + ", " + subMimeType);

//output = The Mime type is: application/msword, application/msword, msword

I found these three methods at http://www.rgagnon.com/javadetails/java-0487.html. However my problem is that the file I am testing these methods on is one I created and so I know it's an Excel file, but still all three methods are incorrectly picking up the type as msword except the first method which I believe is because of the limited number of file types in the built in FileTypeMap that the method uses.

I've had a look around and some people say that it's because the way the offset is detected in the files and so the content type is picked up incorrectly, as pointed out in this wiki on detecting file types in PHP. Unfortunately the wiki then goes on to use the extension to determine the file type which isn't what I want to do as it's unreliable.

Can anyone point me in the right direction to a method that will detect the file types correctly within Java please?

Cheers, Alexei Blue.

Edit: Looks like there is no specific solution to this as @IronMensan said in the comment below. I did find this really interesting research paper that applies machine learning in a few ways to help the issue but there doesn't seem to be a full proof answer. I think my best bet here will be to try and pass the file to an excel file reader and catch any incorrect format exceptions.

No solution is going to be perfect because of the vast number of file types in the world and the problem is ultimately a guessing game based on the file contents. Some methods will be better than others. — IronMensan, Dec 13 '11 at 12:56
Hi IronMensan, thanks for the comment, any idea why when checking the MIME type on an Excel files returns as an msword type though? Thought this would be a well recognisable type by now, and Excel files for me will be the most important to get right... :) Cheers again — Alexei Blue, Dec 13 '11 at 13:11
Does the `file` command return correct results for your samples? It comes with a library `libmagic` although I guess one of your attempts somehow uses that, or a derivative. Still, it's the de facto standard solution. As for the Word misdetections, I guess the recognizer actually finds the top-level container, which is the same for several Office file formats. — tripleee, Dec 13 '11 at 20:47
The file command just says it's a Microsoft Office Document which is a step in the right direction but not specific enough for my needs. I've been looking around and it seems this is an active research area involving feature selection as there's no specific standard for MIME types. I did find this research paper [http://www.alphaminers.net/thesis/International%20Conference/IAKLHSMH_2010.pdf] that might help but it makes what I thought to be a simple problem a lot harder to implement. — Alexei Blue, Dec 15 '11 at 20:53
Please note that there are more MimeDetectors available for Mime-Utils: http://stackoverflow.com/a/13826438/2413303 — EpicPandaForce, Feb 04 '15 at 10:55

score 31 · Answer 1 · answered Feb 12 '12 at 10:33

So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)

import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;

private static final Detector DETECTOR = new DefaultDetector(
        MimeTypes.getDefaultMimeTypes());

public static String detectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);

        /*
         * You might not want to provide the file's name. If you provide an Excel
         * document with a .xls extension, it will get it correct right away; but
         * if you provide an Excel document with .doc extension, it will guess it
         * to be a Word document
         */
        final Metadata metadata = new Metadata();
        // metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}

Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).

Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.

An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).

score 3 · Accepted Answer · answered Dec 16 '11 at 01:11

As mentioned in the comments since there's so many possible file types it could be hit and miss for ALL possibile files, but you probably know the types of files you are typically going to be dealing with. This excellent list of magic numbers has helped me do detection recently around the specific office formats you mentioned (search for Microsoft Office) and you'll see that the MS office file types have a sub-type specified (which is further into the file) and lets you work out specifically which type of file you have. Many new formats like ODT, DOCX, OOXML etc use a ZIP file to hold their data so you might need to detect zip first, then look for specifics.

I've implemented it as a bit of a work around by reading the 8 bytes from the offset 512 and then comparing them with a constant but it works great. :) Thanks Jowierun — Alexei Blue, Dec 16 '11 at 15:57

score 0 · Answer 3 · answered Feb 04 '15 at 11:09

0

I'm not entirely sure how accurate it is, but this worked for me in simple cases.

    FileNameMap fileNameMap = URLConnection.getFileNameMap();
    String type = fileNameMap.getContentTypeFor(filePath);

answered Feb 04 '15 at 11:09

EpicPandaForce

79,669
27
256
428

1

if file extension is different, e.g: I have `redirect.mappings` which is `*.properties` file, then your code above gets `null` – To Kra Apr 20 '15 at 11:15
@ToKra to be honest, normally I wanted to use `MimeUtils` as per http://stackoverflow.com/questions/13775494/java-get-file-type-from-content-using-mimeutil-is-not-working-as-expected?lq=1 but it had a pretty extreme dependency, the entire **sfl4j** logger, and it just wasn't working on Android where I needed it - and I didn't feel like ripping it out of it at the time. – EpicPandaForce Apr 20 '15 at 11:34
That is still the case. I tried to use MimeUtils with spring boot 1.2.7 and the log4j dependency clashed with SB's logback dependency. I tried to exclude log4j from MimeUtils, but then it didn't compile. – Tom Silverman Jun 02 '16 at 09:45

How to accurately determine mime data from a file?

3 Answers3

Linked

Related