14

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder

Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.

The problem is that the jar file size is around 60MB, which takes some time to download.

This is the code I'm using :

from tika import parser

def get_pdf_text(path):
    parsed = parser.from_file(path):
    return parsed['content']

The only workaround I found is this :

1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx

2 - Using tika.TikaClientOnly = True

3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')

But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.

Michael Fish
  • 143
  • 1
  • 7

5 Answers5

2

To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.

TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.

1

if you don't want to add environment variable, you can change the directory that the tika looking for tika-server.jar file with code bellow.

from tika import tika
tika.TikaJarPath = r'TIKA_SERVER_PATH'

in that TIKA_SERVER_PATH the jar file name should be tika-server.jar(the name shouldn't include the version) and also the .md5 file must be there. if the .md5 file isn't the right version as tika-server.jar this method doesn't work and tika will delete your file and download the default version.

bolbol
  • 13
  • 3
1

Here is what worked here :

os.environ['TIKA_SERVER_JAR'] = "<path_to_jar_and_md5>/tika-server.jar"
os.environ['TIKA_PATH'] = "<path_to_jar_and_md5_again>"

These are read at library import, so import the parser after, and reimport if you change them.

Orysza
  • 574
  • 1
  • 5
  • 10
0

After trying almost everything, and debugging tika.py library code I found that you must set both of these variables for this hack to work.

TIKA_SERVER_JAR="/path_to_tika_server/tika-server.jar"
TIKA_SERVER_JAR="/path_to_tika_server"

You also need to provide a .md5 signature file because since Tika version 1.18 .md5 file is not provided (sha512 signature is provided instead, see https://archive.apache.org/dist/tika/). So you need to trick the library to accept your downloaded file.

Or someone could just patch python library :)

0

i am wondering how to get the .md5 file of tika-server.jar, since .md5 file is not provided and sha512 signature is provided instead

lmllmllml
  • 21
  • 1
  • If you have a new question, please ask it by clicking the [Ask Question](https://stackoverflow.com/questions/ask) button. Include a link to this question if it helps provide context. - [From Review](/review/late-answers/31016654) – Simas Joneliunas Feb 12 '22 at 00:15
  • You can find them here inside required version number https://repo1.maven.org/maven2/org/apache/tika/tika-server/ – Orysza Feb 18 '22 at 11:42