How can I use SparkContext
(to create SparkSession
or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach
or foreachPartition
, then it will have a null
value. Shall I create a new SparkContext
in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse()
, every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext
is needed in the function parse
to use sparkContext.sql()
.