0

This is my related code snippet:

for (Path path : Files.list(Paths.get(this.getClass().getClassLoader().getResource(directoryResource).getPath())).collect(Collectors.toList())) {
    String mediaType = this.tikaService.getMimeType(Files.newInputStream(path));
    assertEquals(Files.probeContentType(path), mediaType);
}

As you can figure out, this.tikaService.getMimeType(...) receive an InputStream I provide using Files.newInputStream(path).

Everything works fine except when path is pointing to a zip file.

In this case, Files.newInputStream() is pointing to the content (the embedded file) of the zip file instead of pointing to the zip file.

Any work around?

EDIT

getMimeType code:

public String getMimeType(InputStream is) {
    TikaConfig tikaConfig = TikaConfig.getDefaultConfig();
    Detector detector = tikaConfig.getDetector(); //new DefaultDetector();
    Metadata metadata = new Metadata();
    MediaType mediaType = detector.detect(TikaInputStream.get(is), metadata);
}

EDIT 2 I've also tried disabling ZipContainerDetector using this config file:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors>
    <!-- All detectors except built-in container ones -->
    <detector class="org.apache.tika.detect.DefaultDetector">
      <dhttps://stackoverflow.com/posts/52000097/editetector-exclude class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
    </detector>
  </detectors>
</properties>

but the result is the same.

Jordi
  • 20,868
  • 39
  • 149
  • 333
  • Can you post the code that figures out mediaType from the `InputStream`? Replace `tikaService` reference with actual code. – Karol Dowbecki Aug 24 '18 at 08:24
  • I've added `getMimeType` code. – Jordi Aug 24 '18 at 08:28
  • You should have written your goal: probably you would like to figure out the mime-type of the first file within the zip (if we can assume that every time this function is called, it will get a zip file with the same structure, within the zip there are only file(s) and no folders. If my assumption is right, you should use a [ZipInputStream](https://stackoverflow.com/questions/29515348/getting-specific-file-from-zipinputstream) and you have to find the desired file within the zip, and give that file to TIKA. – m4gic Aug 24 '18 at 08:37

1 Answers1

1

Your Apache Tika code uses DefaultDetector which by default can call ZipContainerDetector if it's available. If you don't want to probe ZIP files for their content media type remove ZipContainerDetector from your configuration.

Files.newInputStream() returns an input stream from a file, nothing more. It doesn't behave differently based on file type or extension.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • I've also tried it using a custom tika config file. I've posted config file on post. – Jordi Aug 24 '18 at 08:45
  • @Jordi you know where is the problem, debug the code to find which detector is examining the ZIP file for it's content type and remove it. There is no magic here. – Karol Dowbecki Aug 24 '18 at 08:46