1

I am trying to read XML files from HDFS using Pyspark to create MapReduce job. The XML files are about 2-3Gb per file and there are almost 2000 such files. I am running the script in Pyspark shell and creating the RDD by reading the data as a text file.

dataFile = sc.textFile(“hdfs:///home/hadoop/pubmed18n0001.xml.gz")

I need the read the files in HDFS as XML files and not text files. What is the best method to read XML data from HDFS using python/pyspark?

RRg
  • 123
  • 1
  • 12

0 Answers0