1

I have a script that searches through a csv file and a fasta file and any ID's that are in both files will be used to extract the fasta sequence from fasta file. It looks like this:

import os
import shlex
import subprocess
import argparse
import glob
import pysam

from pysam import FastaFile

#setup options for this program
parser = argparse.ArgumentParser()
group = parser.add_argument_group('Options for annotation.py')
group.add_argument(
    '-f', '--fasta', help='FASTA file (RAST). all_fprau.fasta', required=True)
group.add_argument(
    '-c', '--csv', help='CSV file (parsed orthomcl groups). parsed_groups.csv',
required=True)
args = parser.parse_args()


fasta = FastaFile(args.fasta)

with open(args.csv) as infile:
    infile.readline()  ## skip header

    for index, line in enumerate(infile):
        strain, matches = line.strip().split("\t")
        if not len(matches):
            print('Warning: Skipping strain {}. No     matches'.format(strain))
        continue
    matches = set(matches.split(','))  # remove duplicates

    key = '{}.faa'.format(strain)
    with open(key, 'w') as out_file:
        for match in matches:
            name = 'fig|{}'.format(match)
            try:
                sequence = fasta[name]
            except KeyError:
                print('Sequence absent in FASTA file {0}: {1}'.format(args.fasta, name))
                continue
            out_file.write(">{0}\n{1}\n".format(name, sequence))

I am having a problem running this script however. I get this error:

File "pysam/libcfaidx.pyx", line 123, in pysam.libcfaidx.FastaFile.__cinit__
File "pysam/libcfaidx.pyx", line 183, in pysam.libcfaidx.FastaFile._open
OSError: error when opening file `all_fprau.fasta`

The fasta file all_fprau.fasta contains a large number of sequences and I think that this may be the problem? When I cut down the size of the file the script works, however I need all the sequences to be available for the next step in this pipeline to work. I have tried to use:

fasta = glob.glob("/home/brian/my_orthomcl_dir/annotations/*.fasta") 

in place of

fasta = FastaFile(args.fasta)

in an attempt to search through the fasta files individually, but i get the error: TypeError: list indices must be integers or slices, not str, in relation to line 38 in the script above:

sequence = fasta[name]

Any help or suggestions would be very much appreciated!

FlyingTeller
  • 17,638
  • 3
  • 38
  • 53
BrianF
  • 11
  • 3
  • Have you tried opening the file manually with `open(args.fasta)`? – FlyingTeller Jun 27 '18 at 09:57
  • Ya, unfortunatly still get an error, TypeError: '_io.TextIOWrapper' object is not subscriptable, in relation to the line sequence = FastaFile(args.fasta) – BrianF Jun 27 '18 at 10:00
  • that makes sense doesn't it, beause you cannot acces a file you opened in that way with []. Do you also have the appropriate index file for your fasta file? – FlyingTeller Jun 27 '18 at 10:06
  • the index file usually gets created during the script running. Very strange that I works with smaller files, Thats whats getting me confused! – BrianF Jun 27 '18 at 10:37
  • How large is the file you are testing it with? – FlyingTeller Jun 27 '18 at 11:03
  • its only 30MB, but if I was to just copy and paste 100-200 sequences from thsi file to a new file the script will work. Unfortunatly I need all sequences to be analyzed. – BrianF Jun 27 '18 at 11:43
  • I have a minor fix, it's sort of papering over the cracks. Im just going to append to the output files rather then write to them. Seems to work. Theres also scripts available to split fasta files to equal smaller files. Thanks for your help anyway! – BrianF Jun 27 '18 at 11:52

1 Answers1

0

It's good to try to make your own solution to formatting and extracting bioinformatic data for the challenge, but generally it's been done before so spare yourself the headache. You could use Jim Kent's faSomeRecords (found here) or some confusing one-liner like: cut -c 2- EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA but it still requires samtools. Also, if you don't already know there is a kind of SO for bioinformatics but also SO bioinformatics.

Sank Finatra
  • 334
  • 2
  • 10