Counting subsequent letters

Question

So I am trying to implement code that will count the next letter in a sentence, using python. so for instance,

"""So I am trying to implement code that will count the next letter in a sentence, using 
python"""

most common letters one after the other

for 's'
- 'o' :1
- 'e' :1
for 'o'
- ' ' :1
- 'd' :1
- 'u' :1
- 'n' :1

I think you get the idea

I already have written code for counting letters prior

def count_letters(word, char):
    count = 0
    for c in word:
        if char == c:
            count += 1
    return count

As you can see this just counts for letters, but not the next letter. can someone give me a hand on this one?

Possible duplicate of [Markov chain on letter scale and random text](https://stackoverflow.com/questions/8660015/markov-chain-on-letter-scale-and-random-text) — Andrey Tyukin, Mar 30 '18 at 19:24

Alex Hall · Accepted Answer · 2018-03-30T19:49:28.963

3

from collections import Counter, defaultdict

counts = defaultdict(Counter)

s = """So I am trying to implement code that will count the next letter in a sentence, using
python""".lower()

for c1, c2 in zip(s, s[1:]):
    counts[c1][c2] += 1

(apart from being simpler, this should be significantly faster than pault's answer by not iterating over the string for every letter)

Concepts to google that aren't named in the code:

for c1, c2 in ... (namely the fact that there are two variables): tuple unpacking
s[1:]: slicing. Basically this is a copy of the string after the first character.

edited Mar 30 '18 at 19:49

answered Mar 30 '18 at 19:40

Alex Hall

34,833
5
57
89

Looks good. I will add here some further reading: [Understanding python's slice notation](https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation) and [Zip lists in python](https://stackoverflow.com/questions/13704860/zip-lists-in-python) – pault Mar 30 '18 at 19:53
Oh, wow... The `defaultdict` is definitely something I should add to my toolbox. Thanks! – Andrey Tyukin Mar 30 '18 at 19:55

pault · Answer 2 · 2018-03-30T20:06:26.023

1

Here's a way using collections.Counter:

Suppose the string you provided was stored in a variable s.

First we iterate over the set of all lower case letters in s. We do this by making another string s_lower which will convert the string s to lowercase. We then wrap this with the set constructor to get unique values.

For each char, we iterate through the string and check to see if the previous letter is equal to char. If so, we store this in a list. Finally, we pass this list into the collections.Counter constructor which will count the occurrences.

Each counter is stored in a dictionary, counts, where the keys are the unique characters in the string.

from collections import Counter

counts = {}
s_lower = s.lower()
for char in set(s_lower):
    counts[char] = Counter(
        [c for i, c in enumerate(s_lower) if i > 0 and s_lower[i-1] == char]
    )

For your string, this has the following outputs:

>>> print(counts['s'])
#Counter({'i': 1, 'e': 1, 'o': 1})

>>> print(counts['o'])
#Counter({' ': 2, 'd': 1, 'n': 1, 'u': 1})

One caveat is that this method will iterate through the whole string for each unique character, which could potentially make it slow for large lists.

Here is an alternative approach using collections.Counter and collections.defaultdict that only loops through the string once:

from collections import defaultdict, Counter

def count_letters(s):
    s_lower = s.lower()
    counts = defaultdict(Counter)
    for i in range(len(s_lower) - 1):
        curr_char = s_lower[i]
        next_char = s_lower[i+1]
        counts[curr_char].update(next_char)
    return counts

counts = count_letters(s)

We loop over each character in the string (except the last) and on each iteration we update a counter using the next character.

edited Mar 30 '18 at 20:06

answered Mar 30 '18 at 19:35

pault

41,343
15
107
149

The first test case should be `Counter({'o': 1, 'e': 1, 'i': 1})`. You're missing `'o'`. – Alex Hall Mar 30 '18 at 19:41
1

Also `set(map(str.lower, s))` can just be `set(s.lower())` – Alex Hall Mar 30 '18 at 19:42
@AlexHall good catch. I forgot one `.lower()`. Edited now. Also good point about the `set(s.lower()`. – pault Mar 30 '18 at 19:43
So, it will run tens or even hundreds of times through the entire text? Once for every character? – Andrey Tyukin Mar 30 '18 at 19:48
@AndreyTyukin I didn't say it was the *best* solution. – pault Mar 30 '18 at 19:49
Call .lower once and store the result. – Alex Hall Mar 30 '18 at 19:50
@pault I didn't claim it was "the worst" either ;) My own solution uses unnecessary `O(n * log(n))`-sorting just so I don't have to reimplement a `groupby` that doesn't rely on the order, so it's not the most efficient solution either... – Andrey Tyukin Mar 30 '18 at 19:51
@AndreyTyukin added a second method that should be more efficient. Only once through the string. – pault Mar 30 '18 at 20:01

score 1 · Answer 3 · answered Mar 30 '18 at 19:44

Here is a relatively terse way to do it:

from itertools import groupby
from collections import Counter

def countTransitionFrequencies(text):
  prevNext = list(zip(text[:-1], text[1:]))
  prevNext.sort(key = lambda pn: pn[0])
  transitions = groupby(prevNext, lambda pn: pn[0])
  freqs = map(
    lambda kts: (kts[0], Counter(map(lambda kv: kv[1], kts[1]))), 
    transitions
  )
  return freqs

Explanation:

zip creates list of pairs with (previous, next) characters
The pairs are sorted and grouped by the previous character
The frequencies of the next characters (extracted from pairs by kv[1]) are then counted using Counter.

Sorting is not really necessary, but unfortunately, this is how the provided groupby works.

An example:

for k, v in countTransitionFrequencies("hello world"):
  print("%r -> %r" % (k, v))

This prints:

' ' -> Counter({'w': 1})
'e' -> Counter({'l': 1})
'h' -> Counter({'e': 1})
'l' -> Counter({'l': 1, 'o': 1, 'd': 1})
'o' -> Counter({' ': 1, 'r': 1})
'r' -> Counter({'l': 1})
'w' -> Counter({'o': 1})

Stoicas · Answer 4 · 2018-03-30T21:15:41.360

This should work, the only thing is it doesn't sort the values, but that can be solved by creating a new dictionary with list of tuples (char, occurrences) and using sorted function on tuple[1].

def countNext(word):
    d = {}
    word = word.lower()
    for i in range(len(word) - 1):
        c = word[i]
        cc = word[i+1]
        if(not c.isalpha() or not cc.isalpha()):
            continue
        if c in d:
            if cc in d[c]:
                d[c][cc] += 1
            else:
                d[c][cc] = 1
        else:
            d[c] = {}
            d[c][cc] = 1
    return d

Counting subsequent letters

4 Answers4