A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet
writes into the HDFS, on the hand, df.toPandas().to_csv()
writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
- How to make sure that the outputs are created in the relative that is specified in the code ?
- If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I read these discussions: HDFS paths, another one here, however it's not very clear to me. What can I try next?