0

I'm making some tests with the TIKA-app (v 1.23) to extract embedded resources from the input-file, which works great by specifying the -z parameter on the command-line using the app. This parameter enables embedded resource extraction and writes the resources to the working directory. Now, I would like to use this functionality, based on the TIKA-server. However, I haven't been able to find the correct way to do so in the documentation and I wonder or the server-variant of TIKA provides this option?

So, how can I extract embedded resources using the TIKA-server application? Please note, I'm not looking for the content of the embedded resources, but for the actual binary file data (I want to separate the attachments from the input file)

TVA van Hesteren
  • 1,031
  • 3
  • 20
  • 47

1 Answers1

1

There is a similar function available through Apache Tika Server's /unpack endpoint. If you combine this with the X-Tika-PDFExtractInlineImages header set to true, it does the equivalent processing.

For example:

curl -T test.pdf http://localhost:9998/unpack > test.zip --header "X-Tika-PDFExtractInlineImages: true"

Will return a ZIP file with all the images within the ZIP.

You can read more about the endpoint here.

Dave Meikle
  • 226
  • 2
  • 5
  • Do you happen to know where I can see a list of the available header options? – Dent7777 May 12 '21 at 19:53
  • 1
    There isn't a definitive list, we could probably build something to generate them. This answer gives the route to finding them all https://stackoverflow.com/questions/62011038/apache-tika-server-request-header-parameters – Dave Meikle May 20 '21 at 19:57
  • I did manage to track down that thread after posting. It got me what I needed. I definitely would have benefitted from that information being available on the confluence page. If not a full list, then at least links to the OCR and PDF api's and instructions for renaming them for use with Tika-server. – Dent7777 May 21 '21 at 12:19