22

I have tried these ways of finding the MIME type of a file...

Path source = Paths
                .get("C://Users/akash/Desktop/FW Internal release of MSTClient-Server5.02.04_24.msg");
        System.out.println(Files.probeContentType(source));

The above code returns null...
And if I use the TIKA API from Apache to get the MIME type then it gives it as text/plain...

But I want the result as application/vnd.ms-outlook

UPDATE

I also used MIME-Util.jar as follows with code...

MimeUtil2 mimeUtil = new MimeUtil2();
        mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
        RandomAccessFile file1 = new RandomAccessFile(
                "C://Users/akash/Desktop/FW Internal release of MSTClient-Server5.02.04_24.msg",
                "r");
        System.out.println(file1.length());
        byte[] file = new byte[624128];
        file1.read(file, 0, 624128);
        String mimeType = MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(file)).toString();

This gives me output as application/msword

UPDATE:

Tika API is out of scope as it is too large to include in the project...

So how can I find the MIME type?

CoderNeji
  • 2,056
  • 3
  • 20
  • 31
  • You can use [magic number](https://en.wikipedia.org/wiki/Magic_number_%28programming%29) to check the file and return the mimetype `application/vnd.ms-outlook`. For .msg : `D0 CF 11 E0 A1 B1 1A E1` – Duffydake Jun 26 '15 at 11:22
  • Can you please give me link reference from where you got this particular magic number... because it exists in every file having CFB configuration for its packing of bytes... – CoderNeji Jun 26 '15 at 11:30
  • I found it [here](https://billatnapier.wordpress.com/2013/04/22/magic-numbers-in-files/) but your are right, this seems to be not correct. – Duffydake Jun 26 '15 at 11:31
  • the .MSG file you are using has been generated from which program? – Paizo Jun 29 '15 at 08:01
  • It is created using outlook. – CoderNeji Jun 29 '15 at 09:43
  • ok that you changed the questions and don't want to use apache tika, is Apache POI too _big_ as well? – Paizo Jul 02 '15 at 09:37
  • Apache poi is good to include – CoderNeji Jul 02 '15 at 09:43

4 Answers4

10

I tried some of the possible ways and using tika gives the result you expected, I don't see the code you used so i cannot double check it.

I tried different ways, not all in the code snippet:

  1. Java 7 Files.probeContentType(path)
  2. URLConnection mime detection from file name and content type guessing
  3. JDK 6 JAF API javax.activation.MimetypesFileTypeMap
  4. MimeUtil with all available subclass of MimeDetector I found
  5. Apache Tika
  6. Apache POI scratchpad

Here the test class:

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.net.URLConnection;
import java.util.Collection;

import javax.activation.MimetypesFileTypeMap;

import org.apache.tika.detect.Detector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AutoDetectParser;

import eu.medsea.mimeutil.MimeUtil;

public class FindMime {

    public static void main(String[] args) {
        File file = new File("C:\\Users\\qwerty\\Desktop\\test.msg");

        System.out.println("urlConnectionGuess " + urlConnectionGuess(file));

        System.out.println("fileContentGuess " + fileContentGuess(file));

        MimetypesFileTypeMap mimeTypesMap = new MimetypesFileTypeMap();

        System.out.println("mimeTypesMap.getContentType " + mimeTypesMap.getContentType(file));

        System.out.println("mimeutils " + mimeutils(file));

        System.out.println("tika " + tika(file));

    }

    private static String mimeutils(File file) {
        try {
            MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
            MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.ExtensionMimeDetector");
//          MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.OpendesktopMimeDetector");
            MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.WindowsRegistryMimeDetector");
//          MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.TextMimeDetector");
            InputStream is = new BufferedInputStream(new FileInputStream(file));
            Collection<?> mimeTypes = MimeUtil.getMimeTypes(is);
            return mimeTypes.toString();
        } catch (Exception e) {
            // TODO: handle exception
        }
        return null;
    }

    private static String tika(File file) {
        try {
            InputStream is = new BufferedInputStream(new FileInputStream(file));
            AutoDetectParser parser = new AutoDetectParser();
            Detector detector = parser.getDetector();
            Metadata md = new Metadata();
            md.add(Metadata.RESOURCE_NAME_KEY, "test.msg");
            MediaType mediaType = detector.detect(is, md);
            return mediaType.toString();
        } catch (Exception e) {
            // TODO: handle exception
        }
        return null;
    }

    private static String urlConnectionGuess(File file) {
        String mimeType = URLConnection.guessContentTypeFromName(file.getName());
        return mimeType;
    }

    private static String fileContentGuess(File file) {
        try {
            InputStream is = new BufferedInputStream(new FileInputStream(file));
            return URLConnection.guessContentTypeFromStream(is);
        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }

}

and this is the output:

urlConnectionGuess null
fileContentGuess null
mimeTypesMap.getContentType application/octet-stream
mimeutils application/msword,application/x-hwp
tika application/vnd.ms-outlook

Updated I added this method to test other ways with Tika:

private static void tikaMore(File file) {
    Tika defaultTika = new Tika();
    Tika mimeTika = new Tika(new MimeTypes());
    Tika typeTika = new Tika(new TypeDetector());
    try {
        System.out.println(defaultTika.detect(file));
        System.out.println(mimeTika.detect(file));
        System.out.println(typeTika.detect(file));
    } catch (Exception e) {
        // TODO: handle exception
    }
}

tested with a msg file without extension:

application/vnd.ms-outlook
application/octet-stream
application/octet-stream

tested with a txt file renamed to msg:

text/plain
text/plain
application/octet-stream

seems that the most simple way by using the empty constructor is the most reliable in this case.

Update you can make your own checker using Apache POI scratchpad, for example this is a simple implementation to get the mime of the message or null if the file is not in the proper format (usually org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature):

import org.apache.poi.hsmf.MAPIMessage;

public class PoiMsgMime {

    public String getMessageMime(String fileName) {
        try {
            new MAPIMessage(fileName);
            return "application/vnd.ms-outlook";
        } catch (Exception e) {
            return null;
        }
    }
}
Paizo
  • 3,986
  • 30
  • 45
  • Its not giving me the desired solution.... Even if i take a text file and rename its extension as .msg, and use taht file to get the mime type, then too it gives the output as tika application/vnd.ms-outlook... Thanks for you work though... – CoderNeji Jul 01 '15 at 05:53
  • see if my updated answer may help. The initial tika test is fooled by using `md.add(Metadata.RESOURCE_NAME_KEY, "test.msg");` that makes it rely on the file extension – Paizo Jul 01 '15 at 10:23
  • Your updated code has the same problem... Sorry... Do the following steps.... 1. create a text file. 2. Save it. 3. Rename the file extension to .msg 4. Run the program using this file.... You will get the output as application/vnd.ms-outlook – CoderNeji Jul 01 '15 at 10:35
  • it is exactly what i did using `tikaMore`method and the result is `text/plain`, please give it a try with the method above – Paizo Jul 01 '15 at 10:42
  • Is there any other way of doing it... Except the tika api because its too large.. and not serving the purpose correctly – CoderNeji Jul 01 '15 at 10:44
  • please check my last update using Apache POI scratchpad – Paizo Jul 05 '15 at 22:18
  • when i tried with tika-core 1.14, it gives me application/x-tika-msoffice. not application/vnd.ms-outlook – maya16 Jun 07 '17 at 06:52
4

Taking a cue from comment of @Duffydake, I tried reading the magic numbers. Agreed that first 8 bytes of header for MS files remains same D0 CF 11 E0 A1 B1 1A E1 ( Interesting to see first four byte which looks lik eDoCFilE) but you can check this link how to understand complete header and find the file type. (e.g in the link finds an excel file but you can use similar byte reading to find the msg file type)

If you can make assumption that no one is going to play around and store, .doc or .xls file as .msg file, then you can just read the first 8 bytes of header and combine it with file extension e.g if(fileExtension.equals(".msg")&&hexHeaderString.equals('D0 CF 11 E0 A1 B1 1A E1'){mimeType=="application/vnd.ms-outlook"}

Optional
  • 4,387
  • 4
  • 27
  • 45
  • Actually my application is for a client and here i can't assume anything... i have already tried the header reading of 8 bytes.... Sorry.... – CoderNeji Jul 05 '15 at 13:29
  • then don't read the 8 bytes and read more bytes as mentioned in the link. Link clearly explains how u can figure out the file is excel from header. You can try similar header reading to find the .msg file. Did you check the link I pasted – Optional Jul 06 '15 at 04:39
2

What you could do is to try to convert the file to byte[] and then useMimeMagic (Maven location here) to handle it. Something like that:

byte[] data = FileUtils.toByteArray("file.msg");
MagicMatch match = Magic.getMagicMatch(data);
String mimeType = match.getMimeType();

I'm not really sure that this will work 100%, but to try is not to die :)

user
  • 3,058
  • 23
  • 45
0

I had to get another workaround. What I found was that MS documents (doc, docx, xls, xlsx, msg) are compressed files with a different extension. I have not tested every MS File Type as it is outside of current scope

Simply expand the file and:

Docx : open [Content_Types].xml and check if it contains "wordprocessingml"

XlsX : open [Content_Types].xml and check if it contains "spreadsheetml"

doc : check for file "WordDocument"

xls : check for file "Workbook"

msg : check for file "__properties_version1.0"

I am still testing msg to see if there is something better to use, but this file exists in sent and unsent messages, so I assume it is safe to use.

Atron Seige
  • 2,783
  • 4
  • 32
  • 39
  • I am working in .Net, so I am not sure how to do it in Java. In my case we use the 7zip application to expand the files. You can (I assume) use the built in Compression/Decompression modules in your environment. Look at this post. http://stackoverflow.com/questions/9324933/what-is-a-good-java-library-to-zip-unzip-files – Atron Seige Sep 17 '15 at 13:20