1

I am new to python and really programming in general and am learning python through a website called rosalind.info, which is a website that aims to teach through problem solving.

Here is the problem, wherein you're asked to calculate the percentage of guanine and thymine to the string of DNA given to for each ID, then return the ID of the sample with the greatest percentage.

I'm working on the sample problem on the page and am experiencing some difficulty. I know my code is probably really inefficient and cumbersome but I take it that's to be expected for those who are new to programming.

Anyway, here is my code.

gc = open("rosalind_gcsamp.txt","r")
biz = gc.readlines()
i = 0
gcc = 0
d = {}
for i in xrange(biz.__len__()):
    if biz[i].startswith(">"):
        biz[i] = biz[i].replace("\n","")
        biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
        del biz[i+2]

What I'm trying to accomplish here is, given input such as this:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG

Break what's given into a list based on the lines and concatenate the two lines of DNA like so:

['>Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', 'TCCCACTAATAATTCTGAGG\n']

And delete the entry two indices after the ID, which is >Rosalind. What I do with it later I still need to figure out.

However, I keep getting an index error and can't, for the life of me, figure out why. I'm sure it's a trivial reason, I just need some help.

I've even attempted the following to limited success:

for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
    biz[i] = biz[i].replace("\n","")
    biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
elif biz[i].startswith("A" or "C" or "G" or "T") and biz[i+1].startswith(">"):
    del biz[i]

which still gives me an index error but at least gives me the biz value I want.

Thanks in advance.

CelineDion
  • 906
  • 5
  • 21
  • 2
    One of your issues is this: http://stackoverflow.com/questions/15112125/how-do-i-test-one-variable-against-multiple-values – poke Apr 27 '15 at 16:35
  • 2
    This doesn't really answer your question, but consider using `len(thing)` instead of `thing.__len__()`. – Kevin Apr 27 '15 at 16:37

2 Answers2

1

You are looping over the length of biz. So in your last iteration biz[i+1] and biz[i+2] don't exist. There is no item after the last.

Klaus D.
  • 13,874
  • 5
  • 41
  • 48
  • So he'd want to run the range to be `xrange(len(biz)-2)` to allow for `biz[i+2]` – MasterOdin Apr 27 '15 at 16:37
  • Not really, that'd still cause problems because this loop should only run on certain lines (I presume that the file is a lot longer with multiple sets of this three line data set). Instead the loop should pass when it's not a line starting with '>' and it shouldn't remove any elements from the list during the loop so that the length is preserved. – SuperBiasedMan Apr 27 '15 at 16:47
  • @SuperBiasedMan is right, it will need to work on something that is greater than three lines, but MasterOdin's solution worked for just the sample. – CelineDion Apr 27 '15 at 16:53
1

It is very easy do with itertools.groupby using lines that start with > as the keys and as the delimiters:

from itertools import groupby
with open("rosalind_gcsamp.txt","r") as gc:
    # group elements using  lines that start with ">" as the delimiter
    groups = groupby(gc, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        # if k is False we a non match to our not x.startswith(">")
        # so use the value v as the key and call next on the grouper object
        # to get the next value
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val

print(d)
{'>Rosalind_0808': 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT', '>Rosalind_5959': 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', '>Rosalind_6404': 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'}

If you need order use a collections.OrderedDict in place of d.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321