Pyspark: How to access XML files from HDFS and read XML files using Pyspark/Python

Asked Dec 06 '18 at 20:25

Active Dec 06 '18 at 22:38

Viewed 71 times

I am trying to read XML files from HDFS using Pyspark to create MapReduce job. The XML files are about 2-3Gb per file and there are almost 2000 such files. I am running the script in Pyspark shell and creating the RDD by reading the data as a text file.

dataFile = sc.textFile(“hdfs:///home/hadoop/pubmed18n0001.xml.gz")

I need the read the files in HDFS as XML files and not text files. What is the best method to read XML data from HDFS using python/pyspark?

edited Dec 06 '18 at 22:38

asked Dec 06 '18 at 20:25

RRg

Pyspark: How to access XML files from HDFS and read XML files using Pyspark/Python

0 Answers0