2

Reference to this question I would like to send a MS Word (.doc) file to a tika application running as a service, how can I do this?

There is this link for running tika: http://mimi.kaktusteam.de/blog-posts/2013/02/running-apache-tika-in-server-mode/

But for the python code to access it I am not sure if I can use sockets or urllib or what exactly?

Community
  • 1
  • 1
hmghaly
  • 1,411
  • 3
  • 29
  • 47
  • Is there a reason why you're planning to use the Tika App Server, rather than the more fully-featured [Tika JAXRS server](https://wiki.apache.org/tika/TikaJAXRS)? – Gagravarr Oct 14 '13 at 13:37
  • It is basically what I managed to install on my system, as I cannot easily install new things – hmghaly Oct 14 '13 at 13:54

1 Answers1

3

For remote access to Tika, there are basically two methods available. One is the Tika JAXRS Server, which provides a full RESTful interface. The other is the simple Tika-App --server mode, which just works at a network pipe level.

For production use, you'll probably want to use the Tika JAXRS server, as it's more fully featured. For simple testing and getting started, the Tika App in Server mode ought to be fine

For the latter, just connect to the port that you're running the Tika-App on, stream it your document data, and read your html back. For example, in one terminal run

$ java -jar tika-app-1.3.jar --server --port 1234

Then, in another, do

$ nc 127.0.0.1 1234 < test.pdf

You'll then see the html returned of your test PDF

From python, you just want a simple socket call much as netcat there is doing, send over the binary data, then read back your result. For example, try something like:

#!/usr/bin/python
import socket, sys

# Where to connect
host = '127.0.0.1'
port = 1234

if len(sys.argv) < 2:
  print "Must give filename"
  sys.exit(1)

filename = sys.argv[1]
print "Sending %s to Tika on port %d" % (filename, port)

# Connect to Tika
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host,port))

# Open the file to send
f = open(filename, 'rb')

# Stream the file to Tika
while True:
  chunk = f.read(65536)
  if not chunk:
    # EOF
    break
  s.sendall(chunk)

# Tell Tika we have sent everything
s.shutdown(socket.SHUT_WR)

# Get the response
while True:
  chunk = s.recv(65536)
  if not chunk:
    # EOF
    break
  print chunk
Gagravarr
  • 47,320
  • 10
  • 111
  • 156