Apache Tika not able to parse HDFS files

Question

I am using Tika library to parse documents stored in Hadoop Cluster.

I am using the following code:-

import tika
import urllib3
from tika import parser

data = parser.from_file("hdfs://localhost:50070/user/sample.txt")

On linux, if I give a local path, tika is able to parse but for the hdfs path I get a

Spark I/O error: No such file or directory.

Any leads/alternatives would be really helpful.

Mobin Ranjbar · Answer 1 · 2018-03-14T14:25:10.447

0

Tika python module does not support reading from HDFS as I checked the source code. You should add tika jar to pyspark/spark-shell with the command below and check Tika Usage Documentation to know how to parse file(parser.from_file is Python implementation that does not work with HDFS):

./pyspark --jars /path/to/your/local/tika/jar/file

or

./spark-shell --jars /path/to/your/local/tika/jar/file

Note that port number to read data from HDFS is 9000 or 8020 instead of 50070.

edited Mar 14 '18 at 14:25

answered Mar 13 '18 at 13:14

Mobin Ranjbar

1,320
1
14
24

I have tried both the ports. Getting the same error. Not sure whether Tika can parse HDFS files. – Sugandha Mishra Mar 14 '18 at 05:13
@SugandhaMishra Tika python module does not support reading from HDFS as I checked the source code. You should add tika jar to pyspark with `pyspark --jars /path/to/your/jar` commands and use it in Spark + HDFS framework. Test it out. – Mobin Ranjbar Mar 14 '18 at 06:00
Thanks for the quick response. Here is what I did:- tika jar is located in local unix in usr/share/java From the local unix I entered the pyspark shell and executed tika jar. pyspark --jars /usr/share/java/tika-server-1.16.jar After entering the spark shell, I ran the above code, however the output which I get is: 'No file found on the server'. My doubt is: if in the tika parser, I am giving an hdfs path, ('hdfs://localhost:50070/.."), would i need to mention the hdfs path of the tika jar while using pyspark --jars /path/to/your/jar? Any pointers on it would be really helpful – Sugandha Mishra Mar 14 '18 at 13:02
@SugandhaMishra No, it is not needed to put jar files in HDFS. It works with local path. When does the error("No file found on the server") happen? I need more details. If you run pyspark with --jars switch, you should use the java class by py4j(see example here:https://stackoverflow.com/questions/33544105/running-custom-java-class-in-pyspark). If you can use spark-shell, you just import org.apache.tika._ and use it. – Mobin Ranjbar Mar 14 '18 at 14:22
Thanks for the insight. Firstly, the error occurs in the following line: data = parser.from_file(hdfs://localhost_or_ip:9000/....") Note that, when I ran the same code to parse files in the local unix and not on cluster (by giving path = "home/filename.txt"), it ran perfectly fine. Seems that it doesn't read hdfs paths as you confirmed above. I tried your code of entering the pyspark shell (pyspar --jars switch) and using py4j class. There in the shell, I am able to import org.apache.tika.parser but when i use parser.from_file, it throws an error saying parser is not defined. – Sugandha Mishra Mar 15 '18 at 15:06
I have been stuck up at this stage. Will post a screenshot shortly. – Sugandha Mishra Mar 15 '18 at 15:08
@SugandhaMishra Note that `org.apache.tika.parser` class does not have any `from_file()` method. It is for tika python module and you have to use java methods. I think using spark-shell is easier than that. – Mobin Ranjbar Mar 15 '18 at 15:11
Yes. I would try that in scala. The python tika module has a method parser.from_buffer. Instead of file path it takes string as an argument. It is running fine in a Unix machine which has internet access. when I tried the same in cluster, I get this error "Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.16/tika-server-1.16.jar to /tmp/tika-server.jar." Note that this cluster doesn't have access to the internet. Do you think if we provide internet to the cluster, we would be able to parse documents using this method. – Sugandha Mishra Mar 16 '18 at 08:31
@SugandhaMishra It is not needed to have internet access. The class ‘org.apache.tika’ has some dependencies But it does not throw this kind of error and it does not search. It will throw DefClassNotFound exception. – Mobin Ranjbar Mar 16 '18 at 08:44
I have attached two screenshots. One is on Unix with internet and the other on a cluster machine without internet. As you can see it is hitting the internet ! – Sugandha Mishra Mar 16 '18 at 09:39
@SugandhaMishra Forget about using tika python module as I said. Read my comments once again. – Mobin Ranjbar Mar 16 '18 at 10:08

Apache Tika not able to parse HDFS files

1 Answers1