python position frequency dictionary of letters in words

Question

To efficiently get the frequencies of letters (given alphabet ABC in a dictionary in a string code I can make a function a-la (Python 3) :

def freq(code):
   return{n: code.count(n)/float(len(code)) for n in 'ABC'}

Then

code='ABBBC'   
freq(code)

Gives me

{'A': 0.2, 'C': 0.2, 'B': 0.6}

But how can I get the frequencies for each position along a list of strings of unequal lengths ? For instance mcode=['AAB', 'AA', 'ABC', ''] should give me a nested structure like a list of dict (where each dict is the frequency per position):

[{'A': 1.0, 'C': 0.0, 'B': 0.0}, 
 {'A': 0.66, 'C': 0.0, 'B': 0.33},
 {'A': 0.0, 'C': 0.5, 'B': 0.5}]

I cannot figure out how to do the frequencies per position across all strings, and wrap this in a list comprehension. Inspired by other SO for word counts e.g. the well discussed post Python: count frequency of words in a list I believed maybe the Counter module from collections might be a help.

Understand it like this - write the mcode strings on separate lines:

AAB
AA
ABC

Then what I need is the column-wise frequencies (AAA, AAB, BC) of the alphabet ABC in a list of dict where each list element is the frequencies of ABC per columns.

I don't quite understand your example. Should the output for first string 'AAB' be {'A': 0.66, 'C': 0.0, 'B': 0.33}? Also, is there always a max of 3 distinct letters in your strings (ABC)? — Allen Qin, Apr 29 '17 at 07:35
The first position in the frequencies are to calculated on AAA, at postition 2 its AAB at postiion 3 its BC. Does that make sense ? Align words on seperate lines, then find frequencies along columns. — user3375672, Apr 29 '17 at 07:42
You could do `itertools.zip_longest(*mcode)` and loop with your `freq` over this. You have to change `len(code)` to reflect the correct length though. — Feodoran, Apr 29 '17 at 07:44
@Allen Do you understand it better now ? - see my update (I wrote it was the frequency at each position ) — user3375672, Apr 29 '17 at 07:55

Heiko Oberdiek · Answer 1 · 2017-04-29T08:01:48.110

Example, the steps are shortly explained in comments. Counter of module collections is not used, because the mapping for a position also contains characters, that are not present at this position and the order of frequencies does not seem to matter.

def freq(*words):
    # All dictionaries contain all characters as keys, even
    # if a characters is not present at a position.
    # Create a sorted list of characters in chars.
    chars = set()
    for word in words:
        chars |= set(word)

    chars = sorted(chars)

    # Get the number of positions.
    max_position = max(len(word) for word in words)

    # Initialize the result list of dictionaries.
    result = [
        dict((char, 0) for char in chars)
        for position in range(max_position)
    ]

    # Count characters.
    for word in words:
        for position in range(len(word)):
            result[position][word[position]] += 1

    # Change to frequencies
    for position in range(max_position):
        count = sum(result[position].values())
        for char in chars:
            result[position][char] /= count  # float(count) for Python 2

    return result


# Testing
from pprint import pprint
mcode = ['AAB', 'AA', 'ABC', '']
pprint(freq(*mcode))

Result (Python 3):

[{'A': 1.0, 'B': 0.0, 'C': 0.0},
 {'A': 0.6666666666666666, 'B': 0.3333333333333333, 'C': 0.0},
 {'A': 0.0, 'B': 0.5, 'C': 0.5}]

In Python 3.6, the dictionaries are even sorted; earlier versions can use OrderedDict from collections instead of dict.

score 1 · Accepted Answer · answered Apr 29 '17 at 08:04

1

A much shorter solution:

from itertools import zip_longest

def freq(code):
    l = len(code) - code.count(None)
    return {n: code.count(n)/l for n in 'ABC'}

mcode=['AAB', 'AA', 'ABC', '']
results = [ freq(code) for code in zip_longest(*mcode) ]
print(results)

answered Apr 29 '17 at 08:04

Feodoran

1,752
1
14
31

Thats very nice - I always forget about itertools – user3375672 Apr 29 '17 at 08:11

Eric Duminil · Answer 3 · 2017-04-29T08:18:56.300

Your code isn't efficient at all :

You first need to define which letters you'd like to count
You need to parse the string for each distinct letter

You could just use Counter:

import itertools
from collections import Counter
mcode=['AAB', 'AA', 'ABC', '']
all_letters = set(''.join(mcode))

def freq(code):
  code = [letter for letter in code if letter is not None]
  n = len(code)
  counter = Counter(code)
  return {letter: counter[letter]/n for letter in all_letters}

print([freq(x) for x in itertools.zip_longest(*mcode)])
# [{'A': 1.0, 'C': 0.0, 'B': 0.0}, {'A': 0.6666666666666666, 'C': 0.0, 'B': 0.3333333333333333}, {'A': 0.0, 'C': 0.5, 'B': 0.5}]

For Python2, you could use itertools.izip_longest.

python position frequency dictionary of letters in words

3 Answers3