During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp
. As per the documentation at theLOCAL_DIRS
env variable that gets defined by the yarn. However, post starting the cluster (I am passingmaster --yarn
) I couldn't find anyLOCAL_DIRS
env variable usingos.environ
but, I can seeSPARK_LOCAL_DIRS
which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, mySPARK_LOCAL_DIRS
ishadoop/spark/tmp
tmp
. Default value ofspark.local.dir
/home/username
. I have tried sending custom value tospark.local.dir
while starting the pyspark using--conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir
. This is the value ofyarn.nodemanager.local-dirs
property in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
- At the end of log4j.properties file located at
$SPARK_HOME/conf/
addlog4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?