I need to read/scan/write files to/from the hdfs from within a pyspark worker.
Note the following api's are not applicable since they run off of the driver:
sc.textFile()
sc.saveAsParquetFile()
etc
It would be very much preferable not to involve additional third party libraries (e.g. pyhadoop).
One option is to shell out e.g.
os.system('hdfs dfs -ls %(hdfsPath)s' %locals())
But is there a more native pyspark way to achieve this?
UPDATE This is not a case of broadcasting data because each worker will read different data from hdfs. One of the use cases is reading a few large binary files in each worker (this is clearly not a case for broadcast). Another case is to read "command" file containing instructions. I have successfully used this pattern in native hadoop and in scala spark.