0

I need help in making this program take a DNA sequence and break it up into 3s(ATGCGTGGC=>ATG,CGT,CCG) to create codons. Then from there it will compare that codon to the 'genecode' dictionary in my code. It seems to work just fine until it gets to the end of the last line in Amino_A where this Keyword error pops up:

Traceback (most recent call last):
  File "main.py", line 101, in <module>
    Amino_A=genecode[codon_1]
KeyError: ''

And I'm wondering what I am doing wrong? I used Line Interpretations to eliminate new lines ('\n') and replace them with a space(''), but when it arrives to the end of the file i'm guessing it leaves a ('') there and the above error pops up. What do I do? Thanks for any help!

genecode = {
  'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
  'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
  'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
  'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
  'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
  'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
  'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
  'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
  'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
  'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
  'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'Glu',
  'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
  'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
  'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
  'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
  'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}      

Gene1=open("HBB Norm.csv", "r")        
Gene2=open("HBB Pos (Sickle Cell).csv", "r")
##HBB Norm and HBB Pos (Sickle Cell) are just the names of the csv files I want to import data from 

Gene_1=Gene1.read()
Gene_11=[Gene_1.replace('\n','') for gene_1 in Gene_1]

Gene1.close()

Gene_2=Gene2.read() 
Gene_22=[Gene_2.replace('\n','') for gene_2 in Gene_2]

Gene2.close()

AA_diff=[] 

for i in range(len(Gene_11)):
  Gene_112=Gene_11[i]
  Gene_113=Gene_22[i]

for codon in range(0,len(Gene_11),3):  
    codon_1=Gene_112[codon:codon+3]    

    Amino_A=genecode[codon_1]
    codon_2=Gene_113[codon:codon+3]
    Amino_B=genecode[codon_2]
    if Amino_A!=Amino_B:      #Trying get a dash line btwn diff Amino Acids 
      AA_diff.append(Amino_B)
      print(Amino_A,'-',Amino_B)
nlsn243
  • 3
  • 1

1 Answers1

0

You have a quite a few fundamental flaws in your code, but (one of) the offending line (s) is/are here:

codon_1=Gene_112[codon:codon+3]

If you do a slice operation like this at the end of something and the slice extends past the available indices of the object, it will not raise an error, but rather return a truncated subset of the original thing. For example:

>>> test = 'ABCD'
>>> test[2:3]
'C'
>>> test[4:]
''

Something similar would happen with a list, but you would get an empty list instead of ''. On that note, '' is the Empty String, not a Space ' ', which is still a character, it just happens to be blank.

Some general improvements you can do that can help clear up these issues:

  • Use with open(...) to read in your files instead of open(...) followed (hopefully) by close(...)
with open('somefile.txt') as f:
    # Do file stuff with f
  • Use ''.join(some_str_with_newlines.splitlines()) to remove newlines from your string instead of str.replace. Doing this will give you a single long string - if you would rather have a list of lines, just do the splitlines(), which returns a list of separated lines.

  • Don't iterate a list using indices - just iterate the list (for thing in some_list:). If you want to iterate something in even sized chunks (such as chunks of 3 like you're doing here), I highly recommend using something to do the chunking for you like the code in this excellent answer (included below). This works on anything iterable, including strings.

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

One other comment - this code will leave Gene_112 and Gene_113 with the result of the final iteration of the for loop, or the last item from Gene_11 and Gene_22

for i in range(len(Gene_11)):
    Gene_112=Gene_11[i]
    Gene_113=Gene_22[i]

Maybe that was an indentation error?

b_c
  • 1,202
  • 13
  • 24