How to open a parquet file in HDFS with Python?

Question

I am looking to read a parquet file that is stored in HDFS and I am using Python to do this. I have this code below but it does not open the files in HDFS. Can you help me change the code to do this?

sc = spark.sparkContext

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

Also, I am looking to save the Dataframe as a CSV file as well.

score 3 · Accepted Answer · answered Feb 01 '18 at 21:56

have a try with

sqlContext.read.parquet("hdfs://<host:port>/path-to-file/commentClusters.parquet")

To find out the host and port, just search for the file core-site.xml and look for xml element fs.defaultFS (e.g. $HADOOP_HOME/etc/hadoop/core-site.xml)

To make it simple, try

sqlContext.read.parquet("hdfs:////path-to-file/commentClusters.parquet")

or

sqlContext.read.parquet("hdfs:/path-to-file/commentClusters.parquet")

Referring Cannot Read a file from HDFS using Spark

To save as csv, try

df_result.write.csv(path=res_path) # possible options: header=True, compression='gzip'

How to open a parquet file in HDFS with Python?

1 Answers1