-2

I'm currently writing a script in python that takes a number of flags. This is my first attempt at such a program, and I am getting an output from the bash script that I don't quite understand. For example when I run the script in the bash shell:

$ python my_script.py -f <input_file.txt> -k <test_string> -c <user_input>

I get this output before my script's output:

usage: rm [-f | -i] [-dPRrvW] file ...
       unlink file

I can't seem to get rid of this, which is frustrating for the prettiness of the output. Any help would be great!

The code I'm using:

import sys, getopt, re, subprocess, collections, itertools

def find_kmers( arguments=sys.argv[1:] ):

    required_opts = ['-f','-c','-k']



    opts, args = getopt.getopt(arguments,'f:k:c:')



    opt_dic = dict(opts)



    for opt in required_opts:

        if opt not in opt_dic:

            return "incorrect arguments, please format as: python_script.py -f <filename> -k <kmer> -c <chromosome_name>"



    def rev_comp(sequence):

        reversed_dic = {'A':'T','T':'A','C':'G','G':'C'}

        return ''.join(reversed_dic[_] for _ in sequence[::-1])



    kmer = opt_dic['-k']

    subprocess.call(['bash','-c',"grep '>' S288C_R64.fasta > grep.tmp"])

    chromosomes = [_[1:].strip() for _ in open('grep.tmp')]

    subprocess.call(['bash','-c','rm','grep.tmp'])

    found = False

    if any(opt_dic['-c']==_ for _ in chromosomes):

        found = True



    def get_sequence(file):

        sequence = ''

        for line in file:

            if line.startswith('>'): break

            sequence += line.strip()

        return sequence.upper()



    ofile = open(opt_dic['-f'])

    if found == True:

        for line in ofile:

            if line.startswith('>'):

                if line[1:].strip() == opt_dic['-c']:

                    sequence = get_sequence(ofile)

                    break



    else:

        return 'chromosome not found in %s. \n chromosomes in file are:%s'%(opt_dic['-f'],', '.join(str(_) for _ in chromosomes))





    kmer_matches1 = re.finditer('(?=%s)'%opt_dic['-k'],sequence)

    kmer_matches2 = re.finditer('(?=%s)'%opt_dic['-k'],rev_comp(sequence))







    def print_statement(start,strand):



        return '%s\thw1_script\tkmer=%s\t%s\t%s\t.\t%s\t.\tID=S288C;Name=S288C\n'%(opt_dic['-c'],opt_dic['-k'],start,start+len(opt_dic['-k'])-1,strand)



    pos_strand = collections.deque()

    neg_strand = collections.deque()

    for match1,match2 in itertools.izip(kmer_matches1,kmer_matches2):

        pos_strand.append(match1.start()+1)

        neg_strand.append(match2.start()+1)



    wfile = open('answer.gff3','w')

    while len(pos_strand)>0 and len(neg_strand)>0:

        if pos_strand[0]<neg_strand[0]:

            start = pos_strand.popleft()

            wfile.write(print_statement(start,'+'))

        else:

            start = neg_strand.popleft()

            wfile.write(print_statement(start,'-'))



    while len(pos_strand)>0:

        start = pos_strand.popleft()

        wfile.write(print_statement(start,'+'))



    while len(neg_strand)>0:

        start = neg_strand.popleft()

        wfile.write(print_statement(start,'-'))



    wfile.close()



    return 'percent-GC = %s'%str(sum(sequence.count(gc) for gc in ["G","C"])/float(len(sequence)))



if __name__ == '__main__':

    print find_kmers()
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
lstbl
  • 527
  • 5
  • 17
  • 3
    Show some source code! – Krumelur Jan 13 '16 at 22:40
  • 4
    you are clearly calling "rm" from somewhere in your source code ... you should not be doing that ... – Joran Beasley Jan 13 '16 at 22:40
  • oh, I figured it out. If I make the change the 'rm' command to : subprocess.call(['bash','-c','rm grep.tmp']), I get what I want. Sorry for the dumb question everyone – lstbl Jan 13 '16 at 22:49
  • 1
    It's not that you *can't* it's just a really bad idea. Plus you can [delete files from Python](http://stackoverflow.com/q/6996603/344286) without having to shell out. – Wayne Werner Jan 13 '16 at 22:50
  • 1
    Why did you remove your program code from the question? This is now (again) eligible for closing as "does not even include the code we are supposed to debug for you", which I have already done, before investigating how the answer could provide comments on code which was not in the current version of the question. – tripleee Jan 15 '16 at 04:11

1 Answers1

3

Invoking bash one-liners requires that the bash commands be a single string. Change:

subprocess.call(['bash','-c','rm','grep.tmp'])

to:

subprocess.call(['bash', '-c', 'rm grep.tmp'])

Or, more reasonably, don't use subprocesses for this, just do:

os.unlink('grep.tmp')  # Or os.remove; same thing, different names

which is much faster and less error prone.

In fact, all of your subprocess usage could be replaced with real Python code, and it would improve it substantially (and much of the Python code simplifies too):

def find_kmers( arguments=sys.argv[1:] ):
    required_opts = ['-f','-c','-k']

    opts, args = getopt.getopt(arguments,'f:k:c:')

    opt_dic = dict(opts)

    for opt in required_opts:
        if opt not in opt_dic:
            return "incorrect arguments, please format as: python_script.py -f <filename> -k <kmer> -c <chromosome_name>"

    def rev_comp(sequence):
        reversed_dic = {'A':'T','T':'A','C':'G','G':'C'}
        return ''.join(reversed_dic[_] for _ in sequence[::-1])

    kmer = opt_dic['-k']
    # Replaces grep with temp file with trivial Python equivalent
    with open('S288C_R64.fasta') as f:
        chromosomes = [line[1:].strip() for line in f if '>' in line]

    # No need for any loop when just checking for exact value
    if opt_dic['-c'] not in chromosomes:
        return 'chromosome not found in %s. \n chromosomes in file are:%s'%(opt_dic['-f'],', '.join(str(_) for _ in chromosomes))


    def get_sequence(file):
        sequence = ''
        for line in file:
            if line.startswith('>'): break
            sequence += line.strip()
        return sequence.upper()

    with open(opt_dic['-f']) as ofile:
        for line in ofile:
            if line.startswith('>'):
                if line[1:].strip() == opt_dic['-c']:
                    sequence = get_sequence(ofile)
                    break


    kmer_matches1 = re.finditer('(?=%s)'%opt_dic['-k'],sequence)
    kmer_matches2 = re.finditer('(?=%s)'%opt_dic['-k'],rev_comp(sequence))

    def print_statement(start,strand):
        return '%s\thw1_script\tkmer=%s\t%s\t%s\t.\t%s\t.\tID=S288C;Name=S288C\n'%(opt_dic['-c'],opt_dic['-k'],start,start+len(opt_dic['-k'])-1,strand)

    pos_strand = collections.deque()
    neg_strand = collections.deque()
    for match1,match2 in itertools.izip(kmer_matches1,kmer_matches2):
        pos_strand.append(match1.start()+1)
        neg_strand.append(match2.start()+1)

    with open('answer.gff3','w') as wfile:
        while pos_strand and neg_strand:
            if pos_strand[0]<neg_strand[0]:
                start = pos_strand.popleft()
                wfile.write(print_statement(start,'+'))
            else:
                start = neg_strand.popleft()
                wfile.write(print_statement(start,'-'))

        for start in pos_strand:
            wfile.write(print_statement(start,'+'))
        for start in neg_strand:
            wfile.write(print_statement(start,'-'))

    return 'percent-GC = %s'%str(sum(sequence.count(gc) for gc in ["G","C"])/float(len(sequence)))
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • I originally was using grep because I thought it was faster than reading the whole file through python. Am I wrong about that? – lstbl Jan 13 '16 at 23:07
  • 1
    @lstbl: For truly huge files, where the filtering to be done eliminates the majority of the input, it might save a tiny bit, but it's unlikely to matter much; the biggest bottleneck to simple filters like this is often I/O, not processing speed. Even so, wrapping it in `bash` is pointless and adds additional overhead (reducing what little gain you might get). Jumping straight to `grep` without testing means you've signed up for more complicated and brittle code. If you don't have compelling performance testing showing a need, simple and robust should always be your first choice. – ShadowRanger Jan 14 '16 at 00:40
  • 2
    @lstbl: If `grep` really does provide a benefit, you still want to avoid the temporary file. Using `filteredlines = subprocess.check_output(['fgrep', '>', 'S288C_R64.fasta']).splitlines()` would get you the same output lines to pass into your list comprehension, and would do it as a single program launch, no shell wrapping, no writing to disk, reading back in, and deleting a file (`check_output` slurps from the subprocess's `stdout` directly, replacing relatively slow disk I/O with direct pipe communication using in memory buffers). – ShadowRanger Jan 14 '16 at 00:45