I am using Anaconda
, spark 1.3
and hadoop
. I have stored a list of xml
documents in a particular directory in hdfs
.
I have to load that xml documents using a python script to find out the duplicate documents using spark.
Example:
conf = SparkConf().setAppName("Sample").setMaster("local[*]")
sc = SparkContext(conf=conf)
dir = sc.textFile("hdfs://XXXXXXX")
configfiles = [os.path.join(dirpath, f) for dirpath, dirnames, files in os.walk(dir)for f in files if f.endswith('.xml')]
In this I have faced with some error:
TypeError: coercing to Unicode: need string or buffer, RDD found
hdfs://xxxxxx MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2
I have used bloom filter to find the duplicates by generating the hash value. That's not a problem here.
By accessing locally stored documents working but not able to process hdfs stored documents.
Could anyone please help me to fix this issue?
Thanks in advance