10

Objective: given the file, determine whether it is of a given type (XML, JSON, Properties etc)

Consider the case of XML - Up until we ran into this issue, the following sample approach worked fine:

    try {
        saxReader.read(f);
    } catch (DocumentException e) {
        logger.warn("  - File is not XML: " + e.getMessage());
        return false;
    }
    return true;

As expected, when XML is well formed, the test would pass and method would return true. If something bad happens and file can't be parsed, false will be returned.

This breaks however when we deal with a malformed XML (still XML though) file.

I'd rather not rely on .xml extension (fails all the time), looking for <?xml version="1.0" encoding="UTF-8"?> string inside the file etc.

Is there another way this can be handled?

What would you have to see inside the file to "suspect it may be XML though DocumentException was caught". This is needed for parsing purposes.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
James Raitsev
  • 92,517
  • 154
  • 335
  • 470
  • Kinda related: http://stackoverflow.com/questions/3600222/code-for-identifying-programming-language-in-a-text-file – PeterK Mar 16 '12 at 14:03
  • You can't get a definitive answer to "what kind of file is it?", only to "can I pretend it is of type X?" (the answer can be "yes" to zero or more X's, not just zero or one). But you can throw in statistics and see if there are many of `<\w+>` (probably XML), many `"\w+"` (probably JSON) compared to the total number of tokens and otherwise it could be properties. – harold Mar 16 '12 at 15:12

3 Answers3

10

Apache Tika gives me the least amount of issues and is not platform specific unlike Java 7 : Files.probeContentType

import java.io.File;
import java.io.IOException;
import javax.activation.MimeType;
import org.apache.tika.Tika;

File inputFile = ...
String type = new Tika().detect(inputFile);
System.out.println(type);

For a xml file I got 'application/xml'

for a properties file I got 'text/plain'

You can however add a Detector to the new Tika()

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.xx</version>
</dependency>
rjdkolb
  • 10,377
  • 11
  • 69
  • 89
2

For those who do not need very precise detection (the Java 7's Files.probeContentType method mentioned by rjdkolb)

Path filePath = Paths.get("/path/to/your/file.jpg");
String contentType = Files.probeContentType(filePath);
kazy
  • 1,111
  • 2
  • 14
  • 24
  • 1
    Hi,In win7 64-bit, using jdk1.8, the above method returns null for all file types.Is this an openjdk bug as mentioned in these places - https://bugs.openjdk.java.net/browse/JDK-8080369 – bespectacled Jan 03 '17 at 12:52
  • This also breaks on some MAC OS versions, Amazon Correto 8 Java etc. I don't recommend using it. – Miron Ophir Aug 22 '21 at 09:36
  • It worth mention that default implementation may just analyze file extension, and fail if the extension is absent. In OpenJDK 16, this is done in sun.nio.fs.AbstractFileTypeDetector. IMO, this could not be considered as reliable file type detection – Eugene Apr 13 '23 at 06:09