I am using Pyspark on Spark 1.3, on a YARN cluster. I am using the Management Node to execute Pyspark commands to create a DataFrame. Then, I am trying to move the Spark Dataframe to a Single CSV on the management node, but I cannot find the file. Specifying my 'home' directory doesn't seem to work, and it consistently gives LOG messages that lead me to believe the File is completed on another Node in the Cluster. Yet, I've searched throughout all nodes without finding the .csv file; even though if I try to execute the '.save' command more than once it says it cannot append, leading me to believe the file is in fact created, somewhere. Here is part of the logs after the csv command is executed:
INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 6) in 2646 ms on <*child node DNS server*> (1/1)
INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 4 finished: saveAsTextFile at package.scala:169, took 2.715508 s
I am launching with this at Command Line:
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
And then executing the following to create Dataframe and attempt to Export in PySpark.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
smData = sqlContext.parquetFile("hdfs://<MGMT NODE IP and Folder directory>")
smData.registerTempTable("temp")
Minutes = sqlContext.sql("Select alt,tail From temp Where year = 2015 And month = 9 And day = 16 and asa is not null and CAST(alt as int) > 3046")
Minutes.save('minutes.csv','com.databricks.spark.csv')