4

I have a list of lists :

sample = [['TTTT', 'CCCZ'], ['ATTA', 'CZZC']]
count = [[4,3],[4,2]]
correctionfactor  = [[1.33, 1.5],[1.33,2]]

I calculate frequency of each character (pi), square it and then sum (and then I calculate het = 1 - sum).

The desired output [[1,2],[1,2]] #NOTE: This is NOT the real values of expected output. I just need the real values to be in this format. 

The problem: I do not how to pass the list of lists(sample, count) in this loop to extract the values needed. I previously passed only a list (eg ['TACT','TTTT'..]) using this code.

  • I suspect that I need to add a larger for loop, that indexes over each element in sample (i.e. indexes over sample[0] = ['TTTT', 'CCCZ'] and sample[1] = ['ATTA', 'CZZC']. I am not sure how to incorporate that into the code.

** Code

list_of_hets = []
for idx, element in enumerate(sample):
    count_dict = {}
    square_dict = {}
    for base in list(element):
         if base in count_dict:
            count_dict[base] += 1
        else:
            count_dict[base] = 1
    for allele in count_dict: #Calculate frequency of every character
        square_freq = (count_dict[allele] / count[idx])**2 #Square the frequencies
        square_dict[allele] = square_freq        
    pf = 0.0
    for i in square_dict:
        pf += square_dict[i]   # pf --> pi^2 + pj^2...pn^2 #Sum the frequencies
    het = 1-pf                    
    list_of_hets.append(het)
print list_of_hets

"Failed" OUTPUT:
line 70, in <module>
square_freq = (count_dict[allele] / count[idx])**2
TypeError: unsupported operand type(s) for /: 'int' and 'list'er
biogeek
  • 197
  • 3
  • 8
  • 1
    The error message tells you **exactly** what's wrong.: `square_freq = (count_dict[allele] / counts[idx])**2` is raising `TypeError: unsupported operand type(s) for /: 'int' and 'list'`. You can't divide an `int` by a `list`. By the way, this doesn't match the code you wrote, which would probably raise another `TypeError` when you try to pass `counts[idx]` to `float`. – juanpa.arrivillaga Oct 11 '16 at 08:43
  • I am trying to use a zip command like `square_freq = [[n/d for n, d in zip(subq, subr)] for subq, subr in zip(count_dict[allele], counts)]`. But I'm still having errors. Any other suggestions? – biogeek Oct 11 '16 at 08:47
  • @PM2Ring I have corrected it. Thanks for pointing it out – biogeek Oct 11 '16 at 08:47
  • What are subq, subr??? – juanpa.arrivillaga Oct 11 '16 at 09:31
  • subq and subr are just any indices (like i and j) that act as index for count_dict[allele] and counts – biogeek Oct 11 '16 at 09:33
  • Ah, sorry, didn't see the rest of that line haha. But anyway, isn't `count_dict[allele]` an `int`? You can't zip that. – juanpa.arrivillaga Oct 11 '16 at 09:33
  • 1
    Also, I have edited the question to highlight the real problem (which I realised as i was troubleshooting) – biogeek Oct 11 '16 at 09:34
  • What should the keys be in `count_dict`? Currently, they are strings like 'CCCZ' and 'TTTT'. Is that correct, or should they be single base letters, like 'C' and 'T'? – PM 2Ring Oct 11 '16 at 09:35
  • If I used a single list, this would work. This (https://eval.in/658468), clarifies and prints the clarifications you raised. – biogeek Oct 11 '16 at 09:54
  • @juanpa.arrivillaga Any thoughts? – biogeek Oct 11 '16 at 10:00

1 Answers1

3

I'm not completely clear on how you want to handle the 'Z' items in your data, but this code replicates the output for the sample data in https://eval.in/658468

from __future__ import division

bases = set('ACGT')
#sample = [['TTTT', 'CCCZ'], ['ATTA', 'CZZC']]
sample = [['ATTA', 'TTGA'], ['TTCA', 'TTTA']]

list_of_hets = []
for element in sample:
    hets = []
    for seq in element:
        count_dict = {}
        for base in seq:
            if base in count_dict:
                count_dict[base] += 1
            else:
                count_dict[base] = 1
        print count_dict

        #Calculate frequency of every character
        count = sum(1 for u in seq if u in bases)
        pf = sum((base / count) ** 2 for base in count_dict.values())
        hets.append(1 - pf)
    list_of_hets.append(hets)

print list_of_hets

output

{'A': 2, 'T': 2}
{'A': 1, 'T': 2, 'G': 1}
{'A': 1, 'C': 1, 'T': 2}
{'A': 1, 'T': 3}
[[0.5, 0.625], [0.625, 0.375]]

This code could be simplified further by using a collections.Counter instead of the count_dict.

BTW, if the symbol that's not in 'ACGT' is always 'Z' then we can speed up the count calculation. Get rid of bases = set('ACGT') and change

count = sum(1 for u in seq if u in bases)

to

count = sum(1 for u in seq if u != 'Z')
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • My final output has to be in the form `[[0.5, 0.625],[0.625, 0.375]]`, because I need be able to distinguish between first element in set1 (['ATTA', 'TTGA']) versus set2['TTCA', 'TTTA'] – biogeek Oct 11 '16 at 10:16
  • Also, don't worry about the "Zs" I have figured out a way of handling it :) – biogeek Oct 11 '16 at 10:24
  • @biogeek: That's easy enough to do. See the new version of my answer. – PM 2Ring Oct 11 '16 at 10:34
  • Also, I don't want to use a function that converts a list into a nested list "externally" (like this http://stackoverflow.com/a/6614975/6824986). This is just a sample data, and I need to be able to make this "split" (list of lists) according to user specified input (eg. If it is [[AA','TT','GG'],['GG','CC',TC'], [AA','TT','GG'] ] ... the final output should have [[1,2,3],[1,2,3], [1,2,3],]) – biogeek Oct 11 '16 at 10:39
  • Thanks a LOT! I have been trying to figure this out forever now. – biogeek Oct 11 '16 at 10:42
  • @biogeek No worries! BTW, I've added an alternative way to do the `count` calculation to my answer. – PM 2Ring Oct 11 '16 at 11:09