test-on Spark without internet
I am using Tika library to parse documents stored in Hadoop Cluster.
I am using the following code:-
import tika
import urllib3
from tika import parser
data = parser.from_file("hdfs://localhost:50070/user/sample.txt")
On linux, if I give a local path, tika
is able to parse but for the hdfs path I get a
Spark I/O error: No such file or directory.
Any leads/alternatives would be really helpful.