0

For example, when I'm in the Spark Shell using PySpark, I might load a file into the spark context with the following command:

readme = sc.textFile("/home/data/README.md")

Then I can perform actions on this RDD(?) like below to count the number of lines in the file:

readme.count() 

However what I want to know, is how can I get a list of all the sc.textFile(s) that I have loaded into the sc (spark context)?

For example there are commands to like below to get the all the configuration but it doesn't list all the textFile(s) that I've loaded.

sc._conf.getAll() 

Is there any way to find all the textFiles that have been loaded into the spark context? A listing?

Valentino
  • 7,291
  • 6
  • 18
  • 34

1 Answers1

1

SparkContext.textFile does not store anything in the Spark Context. Take a look at the sources

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * The text files must be encoded as UTF-8.
   *

You can always cache your RDD's in order to keep them in memory. This post explains caching mechanism.

If you want to keep track of files in your spark job, spark-submit provides the --files flag to upload files to the execution directories. If you have small files that do not change.

If you add your external files, spark-submit provides the --files flag

spark-submit --files your files will be uploaded to this HDFS folder: hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

In your spark application, you can find your files in 2 ways:

1- find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); 

.sparkStaging/application_1449220589084_0508

2- find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); 

hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#spark.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#app.jar,hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

Fares
  • 605
  • 4
  • 19