0

I am trying to make a script that reads a simple text file to identify lines to copy from a second file. Specifically, I am checking against a file in the FASTA format, which has sequence ID as a line starting with > followed by lines that are a sequence of nucleotides/aminoacids like this:

>OTU_1

ACTAAACCCATGTTTCCTCGGGG

GATAAGTAAATGAG

GATGA

>OTU_2

GAGATATAGCG

and so on. So, my 1st file that I am using to search the 2nd Fasta file is effectively as follows

>OTU_1

>OTU_5

>OTU_35

I have a script that can successfully do what I need it to do, which is go through the fasta and copy the sequence ID and associated sequence if it matches from the first file, except because of how I have the operators, if the 1st file has OTU_1, it takes from the fasta OTU_1, OTU_10, OTU_11, OTU_12, and so on.

The script I tried was

`with open("C:/Users/path/keyfile.txt") as f:
    key = f.read().splitlines()
searchfile = open("C:/Users/path/testOTUfasta.txt")
toggle = False
for i in searchfile:
    if (i[0] == ">" and toggle == False):
        if any(s in i for s in key):
            toggle = True
            print(i)
    elif (i[0] == ">" and toggle == True):
        if any(s in i for s in key):
            print(i)
        else:
            toggle == False
    elif toggle == True:
        print(i)`

which resulted in the nonspecific selective behavior I described above. I tried to do this based on another stack overflow answer. I also tried

`with open("C:/Users/path/keyfile.txt") as f:
    key = f.read().splitlines()
searchfile = open("C:/Users/path/testOTUfasta.txt")
toggle = False
for i in searchfile:
    if (i[0] == ">" and toggle == False):
        if i in key:
            toggle = True
            print(i)
    elif (i[0] == ">" and toggle == True):
        if i in key:
            print(i)
        else:
            toggle == False
    elif toggle == True:
        print(i)`

which instead of giving the above result gives nothing. I am largely confused as to 1. why my if 'i in key' doesn't work and 2. how to better use operators and such to be a bit more specific in the selection. Thank you for any help edited 6/26 to fix issues with how the file-examples displayed

shaymew
  • 9
  • 3

3 Answers3

0

When dealing with a structured data format, all the more a standardized one,

  • use a proper parser to convert file data into a data structure (to avoid reinventing the (square) wheel and ensure that you're reading the format correctly in all possible cases), and
  • work with that structure in your code.

For the FASTA format, I could readily find 2 parsers:

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
0

a quick google of fasta file format shows something a little different from your blockquote, testOTUfasta.txt should look something like:

>OTU_1
ACTAAACCCATGTTTCCTCGGGG
GATAAGTAAATGAG
GATGA

>OTU_2
GAGATATAGCG

>OTU_10
ACTAAACCCATGTTTCCTCGGGG
GATAAGTAAATGAG
GATGA

>OTU_20
GAGATATAGCG

and assuming keyfile.txt like this:

>OTU_1
>OTU_5
>OTU_20
>OTU_35

this setup builds a dictionary (because I am assuming you want to do something with the data other than just print a line) and searches it with the keys from the other file.

from pprint import pp

key_file, search_file = 'keyfile.txt', 'testOTUfasta.txt'

with open(search_file) as f1, open(key_file) as f2:
    #build dict with key as line starting with >, value is sequence
    searchfile = {
        k: ''.join(v.replace('\n', ''))
        for k, v in (i.split('\n', maxsplit=1)
        for i in f1.read().split('>') if i)
        }

    pp(searchfile)
    #{'OTU_1': 'ACTAAACCCATGTTTCCTCGGGGGATAAGTAAATGAGGATGA',
    # 'OTU_2': 'GAGATATAGCG',
    # 'OTU_10': 'ACTAAACCCATGTTTCCTCGGGGGATAAGTAAATGAGGATGA',
    # 'OTU_20': 'GAGATATAGCG'}

    #loop through key_file, seach searchfile dict
    for key in (line.strip('>\n') for line in f2):
        if key in searchfile:
            print(key, searchfile[key])
            #do real work here
        else:
            print(key, f'not found in {search_file}.txt')
    #OTU_1 ACTAAACCCATGTTTCCTCGGGGGATAAGTAAATGAGGATGA
    #OTU_5 not found in testOTUfasta.txt.txt
    #OTU_20 GAGATATAGCG
    #OTU_35 not found in testOTUfasta.txt.txt

if the fasta files are large-ish and the computer you are using cannot hold the dictionary, the order will have to be reversed.

richard
  • 144
  • 3
0

So partly my bad for using the print version I made of this, but the goal was ultimately to write a new sub-file from the input query. That said, I got it working, and I don't entirely get WHY it works now when it didn't before, but what worked for me is as follows:

import sys
with open(sys.argv[1], 'r') as f:
    key = f.read().splitlines()
with open(sys.argv[2], 'r') as m:
    search = m.read().splitlines()
outfile = open(sys.argv[3], 'w')
toggle = False
for i in search:
    if toggle == False:
        if i in key:
            toggle = True
    if toggle == True:
        if i[0] == ">":
            if i in key:
                outfile.write(i + '\n')
            else:
                toggle = False
        else:
            outfile.write(i + '\n')

I figured I should post what worked for me in case if anyone else encounters something like this.

shaymew
  • 9
  • 3