Your suspicion about the process already running would indeed be correct. Leaving tika
running in the background means when your script starts means it doesn't restart the java process with the new flag, which means no heap increase.
As to solving that issue, we can do it completely in Python on Windows with the help of psutil
:
from typing import Optional
import psutil
from tika import tika as tika_server
from tika import parser
def get_tika_process() -> Optional[psutil.Process]:
for process in psutil.process_iter(["name", "cmdline"]):
if "java" in process.name():
for part in process.cmdline():
if "tika" in part:
return process
if existing_tika_process := get_tika_process():
print("Found tika process:", existing_tika_process)
print("Existing process args:", existing_tika_process.cmdline())
existing_tika_process.terminate()
terminate_result = existing_tika_process.wait(10)
print(f"Terminated tika; exit code {terminate_result}")
else:
print("No existing tika process found")
tika_server.TikaJavaArgs += "-Xmx1G" # See note {1}
parsed = parser.from_file("spam.txt")
print("Tika server started")
new_tika_process = get_tika_process()
if new_tika_process:
print("New process args:", new_tika_process.cmdline())
print(parsed["metadata"])
print(parsed["content"])
{1} I'm directly appending to tika_server.TikaJavaArgs
as the environment variable is parsed when tika_server
is imported. You can replace with setting the environment variable if you delay the import (as in the first attempt in the question).
Result:
(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
No existing tika process found
2021-10-22 22:50:04,476 [MainThread ] [WARNI] Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '54', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!
(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
Found tika process: psutil.Process(pid=11244, name='java.exe', status='running', started='22:50:04')
Existing process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
Terminated tika; exit code 15
2021-10-22 22:54:40,016 [MainThread ] [WARNI] Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-Xmx1G', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '55', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!
(venv) PS E:\DevProjects\stack-exchange-answers\69637621>
You can definitely improve this (such as for instance, checking to see if your args are the same and skip terminating if they are), but this should get you going again at least.
Additionally, you should look into adding a call to tika.tika.killServer()
at the end of your script to stop the server when you're done with it.