How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

Question

A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.

An example:

Config

feature_outputs= /outputs/features/

File structure:

classifier_setup

     feature_generator.py
     model_execution.py
     config.py
     utils.py
     logs/
     models/
     resources/
     outputs/

Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.

How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.

I read these discussions: HDFS paths, another one here, however it's not very clear to me. What can I try next?

score 1 · Answer 1 · answered Dec 23 '22 at 09:51

Short Answer to your question:

Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into

EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets). Refer AWS documentation

If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.

Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.

The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.

How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

1 Answers1