SparkContext.textFile does not store anything in the Spark Context.
Take a look at the sources
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
* The text files must be encoded as UTF-8.
*
You can always cache your RDD's in order to keep them in memory. This post explains caching mechanism.
If you want to keep track of files in your spark job, spark-submit provides the --files
flag to upload files to the execution directories. If you have small files that do not change.
If you add your external files, spark-submit provides the --files
flag
spark-submit --files
your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508
application_1449220589084_0508 is an example of yarn application ID!
In your spark application, you can find your files in 2 ways:
1- find the spark staging directory by below code: (but you need to have the hdfs uri and your username)
System.getenv("SPARK_YARN_STAGING_DIR");
.sparkStaging/application_1449220589084_0508
2- find the complete comma separated file paths by using:
System.getenv("SPARK_YARN_CACHE_FILES");
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#spark.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#app.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt