Pyspark running external program using subprocess can't read files from hdfs

Question

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.

import sys
import subprocess
from pyspark import SparkContext

def bwaRun(args):
        a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
        result = subprocess.check_output(a)
        return result

sc = SparkContext(appName = 'sub')

ref =    'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'

input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'

chunk_name = []

chunk_name.append(input)

data = sc.parallelize(chunk_name,1)

print data.map(bwaRun).collect()

I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving

bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq

on the shell didn't work because it can't read files from hdfs. Can anyone give me idea how i could solve this?

Thanks in advance!

What exactly do you expect here? If program you use cannot read from HDFS (or STDIN) you'll have to either use POSIX compliant file system, expose HDFS in a way that is accessible by your code (mount with FUSE?) or simply copy to local. On a side note it doesn't use like good use case for Spark. You want to execute long running and computationally heavy process which won't benefit from Spark architecture at all. — zero323, Jul 29 '16 at 09:31
My purpose is to check whether i can run external program in pyspark. For example, data.map(external_program). But this external program takes input file name as argument. Thats why i parallelized input file names located on hdfs. However the external program obviously can't read files from hdfs. — jmoa, Jul 29 '16 at 10:17
You can. You just have to follow exactly the same rules as it was executed without Spark. — zero323, Jul 29 '16 at 10:24
Without spark, external program reads input files from local directory but i want it to read inputs from hdfs. Is it even possible? Thanks. — jmoa, Jul 29 '16 at 12:03
I think you could parallelize something to run your bwa on each node, but it would be running in the context of subprocess, so wouldn't have access to hdfs. A workaround would be to first write pieces of the data from hdfs to local fs on each node, then call the subprocess external application with the path to that piece. If your bwa can pipe to stout then you could pipe to ssh back to hdfs like this http://stackoverflow.com/questions/11270509/putting-a-remote-file-into-hadoop-without-copying-it-to-local-disk — Davos, May 11 '17 at 02:13

Pyspark running external program using subprocess can't read files from hdfs

0 Answers0