2

With docx files, i retrieve application/x-tika-ooxml, but i should retrieve application/vnd.openxmlformats-officedocument.wordprocessingml.document instead

Here is my method :

public String retrieveMimeType(InputStream stream) throws IOException, TikaException {
        TikaInputStream tikaStream = null;
        TikaConfig tikaConfig = new TikaConfig();
        MediaType mediaType = null;

        try {
            mediaType = tikaConfig.getDetector().detect(TikaInputStream.get(stream), new Metadata());
        } catch (Throwable t) {
            throw t;
        } finally {
            if (tikaStream != null) {
                try {
                    tikaStream.close(); 
                } catch (IOException e) {
                }
            }
        }
        return mediaType.toString();
    }

And my dependecies :

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.1.0</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.27</version>
</dependency>

I use tika-core, and tika-parsers for retrieve the right mimetype, but it still give me the bad mimetype...

Zahreddine Laidi
  • 560
  • 1
  • 7
  • 20

2 Answers2

6

Update your tika modules. The version of tika-core and it's modules should always be the same.

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.1.0</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.1.0</version>
</dependency>

The new microsoft document formats (docx, xlsx, ...) are just zip archives from the outside. Older tika versions will not look into them by default, which is why, depending on the version, they will detect them as either application/zip or application/x-tika-ooxml. You can read more about this here.

Analyzing the archives however can result in a decrease in performance. To prevent this you could, depending on your use case, determine the mime type by name (see below) or use existing mime types like the Content-Type header.

final Metadata metadata = new Metadata();
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, fileName);
detector.detect(stream, metadata);

In a HTTP request the file name might also be in the Content-Disposition header.

Florian
  • 177
  • 1
  • 11
  • But let's say that i got a docx file, if i change his extension by a .pdf, it will tell me that my file is an PDF and not a DOCX because it determined the mimeType by his name right ? So i will always be obligated to analyze the archives to be sure that Tika give me the right mimeType ? @Florian – Zahreddine Laidi Nov 29 '21 at 23:23
  • If you want to be sure of the type, you need to give Apache Tika the whole file to detect based on. If you supply name + contents, the name is only used for specialising subtypes, it won't override the content based detection – Gagravarr Nov 30 '21 at 09:55
  • 1
    I am also facing same issue with the office files for instance like spreadsheets where the content type is always "apache/x-tika-ooxml" instead of "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", since data for which the content type is to be determined is always obtained as stream, with no information on the extension, we cannot form the metadata object having property "RESOURCE_NAME_KEY" with filename and extension. I am using tika core and tika-parsers-standard-package both of 2.1.0 versions. Can you suggest some solution to this. – K D Dec 07 '21 at 17:34
0

for me, I would use these old tika version, not sure why but this able to get the result we wanted.

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.23</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.23</version>
</dependency>