I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs. Can anyone give me idea how i could solve this?
Thanks in advance!