1

I want to be able to detect mime types for .one, .onetoc, .onetoc2 files using apache tika. However from their documentation https://tika.apache.org/1.14/formats.html does not seem to have support for it. Using purely file parsing techniques using Tika I always get application/octet-stream instead of application/onenote.

They do support based extension and name based introspection to determine the mime type but that is unreliable as I can always name a file *.one and it would throw mime type as 'application/onenote' which is incorrect.

Any pointers on any library available that can easily detect if a given file is of onenote type or is there something I am missing in Tika?

Keshi
  • 906
  • 10
  • 23
  • Are you able to create a handful of small onenote files, that you're happy to put under the Apache License, which could be used for testing? – Gagravarr Dec 22 '16 at 00:58

1 Answers1

4

For mime-magic driven OneNote file detection, you need Apache Tika 1.15 or later.

For OneNote parsing (metadata, text etc), you either need to wait for Apache 1.24 to be released (due March-ish 2020), or build yourself from source including the patches from Github pull request #303 / TIKA-2224.

And if you're a Tika + OneNote user, give a big thanks to Nicholas DiPiazza (who did most of the work), and Tim Allison (who help review/steer/etc)

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • Thanks! I'd be happy to contribute - let me try pushing in a few files. – Keshi Dec 22 '16 at 03:30
  • 1
    I have gotten a lot of work done towards the parser. But I need some help getting it to parse 2 key things: 1) text sections - https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document and 2) embedded files. here is the github repo https://github.com/nddipiazza/onenote-parser-java – Nicholas DiPiazza Nov 24 '19 at 17:07
  • fixed this stuff mentioned ^^. it is added to tika now. – Nicholas DiPiazza Feb 16 '20 at 14:25