How to have an operator check if a line in a file being read matches a line in a list or another file being read in Python

Question

I am trying to make a script that reads a simple text file to identify lines to copy from a second file. Specifically, I am checking against a file in the FASTA format, which has sequence ID as a line starting with > followed by lines that are a sequence of nucleotides/aminoacids like this:

>OTU_1

ACTAAACCCATGTTTCCTCGGGG

GATAAGTAAATGAG

GATGA

>OTU_2

GAGATATAGCG

and so on. So, my 1st file that I am using to search the 2nd Fasta file is effectively as follows

>OTU_1

>OTU_5

>OTU_35

I have a script that can successfully do what I need it to do, which is go through the fasta and copy the sequence ID and associated sequence if it matches from the first file, except because of how I have the operators, if the 1st file has OTU_1, it takes from the fasta OTU_1, OTU_10, OTU_11, OTU_12, and so on.

The script I tried was

`with open("C:/Users/path/keyfile.txt") as f:
    key = f.read().splitlines()
searchfile = open("C:/Users/path/testOTUfasta.txt")
toggle = False
for i in searchfile:
    if (i[0] == ">" and toggle == False):
        if any(s in i for s in key):
            toggle = True
            print(i)
    elif (i[0] == ">" and toggle == True):
        if any(s in i for s in key):
            print(i)
        else:
            toggle == False
    elif toggle == True:
        print(i)`

which resulted in the nonspecific selective behavior I described above. I tried to do this based on another stack overflow answer. I also tried

`with open("C:/Users/path/keyfile.txt") as f:
    key = f.read().splitlines()
searchfile = open("C:/Users/path/testOTUfasta.txt")
toggle = False
for i in searchfile:
    if (i[0] == ">" and toggle == False):
        if i in key:
            toggle = True
            print(i)
    elif (i[0] == ">" and toggle == True):
        if i in key:
            print(i)
        else:
            toggle == False
    elif toggle == True:
        print(i)`

which instead of giving the above result gives nothing. I am largely confused as to 1. why my if 'i in key' doesn't work and 2. how to better use operators and such to be a bit more specific in the selection. Thank you for any help edited 6/26 to fix issues with how the file-examples displayed

ivan_pozdeev · Answer 1 · 2023-07-07T14:03:45.310

When dealing with a structured data format, all the more a standardized one,

use a proper parser to convert file data into a data structure (to avoid reinventing the (square) wheel and ensure that you're reading the format correctly in all possible cases), and
work with that structure in your code.

For the FASTA format, I could readily find 2 parsers:

SeqIO from the BioPython package (looks like the more mature one, can read other formats and is probably usable by other facilities in the package)
fasta standalone parser (likely lacks integration with other stuff but simpler, may be sufficient for your needs)

score 0 · Answer 2 · answered Jun 23 '23 at 19:51

a quick google of fasta file format shows something a little different from your blockquote, testOTUfasta.txt should look something like:

>OTU_1
ACTAAACCCATGTTTCCTCGGGG
GATAAGTAAATGAG
GATGA

>OTU_2
GAGATATAGCG

>OTU_10
ACTAAACCCATGTTTCCTCGGGG
GATAAGTAAATGAG
GATGA

>OTU_20
GAGATATAGCG

and assuming keyfile.txt like this:

>OTU_1
>OTU_5
>OTU_20
>OTU_35

this setup builds a dictionary (because I am assuming you want to do something with the data other than just print a line) and searches it with the keys from the other file.

from pprint import pp

key_file, search_file = 'keyfile.txt', 'testOTUfasta.txt'

with open(search_file) as f1, open(key_file) as f2:
    #build dict with key as line starting with >, value is sequence
    searchfile = {
        k: ''.join(v.replace('\n', ''))
        for k, v in (i.split('\n', maxsplit=1)
        for i in f1.read().split('>') if i)
        }

    pp(searchfile)
    #{'OTU_1': 'ACTAAACCCATGTTTCCTCGGGGGATAAGTAAATGAGGATGA',
    # 'OTU_2': 'GAGATATAGCG',
    # 'OTU_10': 'ACTAAACCCATGTTTCCTCGGGGGATAAGTAAATGAGGATGA',
    # 'OTU_20': 'GAGATATAGCG'}

    #loop through key_file, seach searchfile dict
    for key in (line.strip('>\n') for line in f2):
        if key in searchfile:
            print(key, searchfile[key])
            #do real work here
        else:
            print(key, f'not found in {search_file}.txt')
    #OTU_1 ACTAAACCCATGTTTCCTCGGGGGATAAGTAAATGAGGATGA
    #OTU_5 not found in testOTUfasta.txt.txt
    #OTU_20 GAGATATAGCG
    #OTU_35 not found in testOTUfasta.txt.txt

if the fasta files are large-ish and the computer you are using cannot hold the dictionary, the order will have to be reversed.

score 0 · Answer 3 · answered Jun 26 '23 at 16:30

So partly my bad for using the print version I made of this, but the goal was ultimately to write a new sub-file from the input query. That said, I got it working, and I don't entirely get WHY it works now when it didn't before, but what worked for me is as follows:

import sys
with open(sys.argv[1], 'r') as f:
    key = f.read().splitlines()
with open(sys.argv[2], 'r') as m:
    search = m.read().splitlines()
outfile = open(sys.argv[3], 'w')
toggle = False
for i in search:
    if toggle == False:
        if i in key:
            toggle = True
    if toggle == True:
        if i[0] == ">":
            if i in key:
                outfile.write(i + '\n')
            else:
                toggle = False
        else:
            outfile.write(i + '\n')

I figured I should post what worked for me in case if anyone else encounters something like this.

How to have an operator check if a line in a file being read matches a line in a list or another file being read in Python

3 Answers3