2

Problem:

Hi I'm new to Python so looking for some help. I have multiple input lines. I'm looking for a way to take every single word from multiple list and then add them up to get the total count on each instance for that word in multiple List. I will really appreciate any guidance.

Kate
  • 35
  • 5
  • Add your code as well. – H. U. Nov 03 '15 at 22:41
  • 5
    I don't understand your expected output – Two-Bit Alchemist Nov 03 '15 at 22:43
  • @Two-BitAlchemist basically counting the indexes of the read sequences in an output (e.g. GGGA shows up at index 0 in read 1, and index 3 in read 2)? – Nick T Nov 03 '15 at 22:50
  • 1
    Can a sequence show up multiple times in a read? Your read 2 shows ATTA twice. Given that, I'm not sure about the output again. – RobertB Nov 03 '15 at 22:52
  • It may be worth mentioning that dictionaries are inherently unordered and keys must be unique. See this page for more info: http://www.tutorialspoint.com/python/python_dictionary.htm – McGlothlin Nov 03 '15 at 22:52
  • Also, you're calling list and dict-specific methods on the same variable `kmer` – McGlothlin Nov 03 '15 at 22:53
  • 1
    Your output is still unclear. Give a real example with real input and the expected output. You don't need 50 columns, just 5, but make it exactly correct. – RobertB Nov 03 '15 at 23:08
  • Please add your explanations in comments to your Q. And see that you use proper Pythonic syntax while explaining. I mean, what is => ??? Do you need a count of the word given by key at the given index? Or total count, or just indices at which the word appears in multiple lists. I'll edit my A accordingly. – Dalen Nov 03 '15 at 23:25

3 Answers3

1

Edit:

"""I need a count of the word given by key(s) in multiple lists at the given index. So what I need is
>>> count(["a", "b", "c"], ["c", "b", "a"])
{"a": [1,0,1], "b": [0,2,0], "c": [1,0,1]}":
"""

I hope this will be right:

def count (*args):
    l = len(args[0])
    end = {}
    for lst in args:
        y = 0
        for x in lst:

            end.setdefault(x, l*[0])
            end[x][y] += 1
            y += 1
    return end

If you need a way to manipulate genome material from RNA/DNA you can search for libraries on pypi.python.org. There are plenty good ones.

Kate
  • 35
  • 5
Dalen
  • 4,128
  • 1
  • 17
  • 35
  • Kate, please just explain what is here 0, 1 and 2, please. In my code they're indexes in each list as lists are given in *args. I can see that you re counting something, but how exactly is still a bit misterious. If you have to have count along multiple lists, you should perhaps consider switching to matrix arithmetic i.e. linear algebra. – Dalen Nov 03 '15 at 23:39
  • @Kate: I edite my A. The result is what you describe. The accumulative count of words over indices traversing the columns of same sized lists containing same words in different order. The code will break if new word is introduced somewhere along. But logic is here. Is it what you need? – Dalen Nov 04 '15 at 00:29
  • Oh, God, Are you an electronic engineer Kate? You need a accumulative count over all lists together or count only over collums. (Imagine putting your list into matrix). If you need only count over collums where words match, then all you need is here, just tweak a bit one of my functions. – Dalen Nov 04 '15 at 00:34
  • I edited it once more, I applied it on your sequences (of course in range(0, 8) - not 50) and the results for GGGA and ATTA matches your samples. The logic is easy. Set a dictionary with entry for each possible word with all count per all n indices set to 0. (A list per item). Then traverse all lists collumwise and count occurrences. y is number of collum and x is the word being counted. Right? – Dalen Nov 04 '15 at 01:32
  • By *normal* I meant the loop that is classic i.e. not a list comprehension. When I am setting the dictionary I used list comprehension, but it creates a lot of later useless lists, therefore, needlessly using memory. Just change it to: for x in args: for y in x: end.setdefault(y, l*[0]); If you want to count how many GGGA's you have, just do: d = count(l1, l2...); sum(d["GGGA"]) – Dalen Nov 04 '15 at 01:43
  • Yes, all lists will be the same size, so I assumed and l is that length. In your case it'll be 50. args[0] is the first list you will pass to count(). args is tuple of all lists you passed in order you passed them in. If you want to pass a list of lists, you can do it with: count(*[l1, l2, ...]); (by putting the star in front of instance. It will be unpacked into arguments.) – Dalen Nov 04 '15 at 01:50
  • Please triple check the output. I don't want to be responsible for some kymera or mutant. Centaurs and hypogrifs are still myth. :D We don't want them flying around, do we? I'll read it all again in the morning. My brain is now exhausted. – Dalen Nov 04 '15 at 01:56
  • Oh, no, you aren't I did today. Played with threads where they shouldn't be used. I love such dangerous challenges, but there are a lot of details to be covered Brainskwashing. :D I edited it. I pushed setdefault() to be used all in one. No need for double looping at all. You pushed the second loop inside, that's why you got an error. You had 4 nested loops even using same var names. Now is OK. Just triplecheck against any sphinx etc. :D If it is all well, you can accept the A, if you want. – Dalen Nov 04 '15 at 02:33
  • Thanks Dalen for all the help !! – Kate Nov 04 '15 at 02:58
0

Take a look at the dict.setdefault() method and the enumerate() function:

def count_items(data):
    count = {}
    for datum in data:
        count[datum] = count.setdefault(datum, 0) + 1
    return count


def collate(*data):
    collated = {}
    for datum in data:
        for k, v in datum.items():
            collated[k] = datum.setdefault(k, 0) + 1
    return collated


def key_position(sequence, key):
    sequence_map = [0 for _ in sequence]
    for i, item in enumerate(sequence):
        if key == item:
            sequence_map[i] += 1
    return sequence_map


data1 = ['a', 'b', 'c', 'd', 'a']
data2 = ['a', 'b', 'c', 'd', 'a']

counted1 = count_items(data1)
counted2 = count_items(data2)

collated = collate(counted1, counted2)

a_positions = key_position(data1, 'a')
Bobby Russell
  • 475
  • 2
  • 12
  • Updated my example with an additional function to collate multiple lists – Bobby Russell Nov 03 '15 at 23:09
  • Kate, collate uses [*args](http://stackoverflow.com/questions/3394835/args-and-kwargs), so you can collate like `collate(data1, data2, data3, ..., datan)`. Not sure if that's actually answering your question... feel free to clarify. – Bobby Russell Nov 03 '15 at 23:22
  • Short of doing your work for you, all I can do is say that you need to generate key positions for each unique key in all of your sequences, then map the sum of those key positions. – Bobby Russell Nov 03 '15 at 23:35
  • Bobby your code is a bit overkilling. And I hope you're not proposing to use nested dictionaries or something like that. – Dalen Nov 03 '15 at 23:44
  • @Dalen no, I'm not proposing to use nested dictionaries at all, just suggesting that the OP look at `dict.setdefault()` and `enumerate()` as tools in her algorithm. – Bobby Russell Nov 03 '15 at 23:59
  • setdefault() is always a nice trick if you can push it in. I agree. And it is fast. You mentioned mapping, so I just thought what and where would you map? I don't like your solution with multiple functions snowballing with lists here and there. It uses interpreter stack too much and it is not easily understood. – Dalen Nov 04 '15 at 00:50
0

First, zip your reads:

Read_1 = ['GGGA', 'ATTA']
Read_2 = ['GATT', 'ATTA']
reads = zip(Read_1, Read_2)
# ['GGGA', 'GATT'], ['ATTA', 'ATTA']

Then, count stuff:

from collections import Counter
counters = [Counter(read) for read in reads]

Then ask for the frequence of a given sequence:

print(list(cnt['ATTA'] for cnt in counters)
# [0, 2]
print(list(cnt['GGGA'] for cnt in counters)
# [1, 0]
njzk2
  • 38,969
  • 7
  • 69
  • 107