Python strings: quickly summarize the character count in order of appearance

Question

Let's say I have the following strings in Python3.x

string1 = 'AAAAABBBBCCCDD'
string2 = 'CCBADDDDDBACDC'
string3 = 'DABCBEDCCAEDBB'

I would like to create a summary "frequency string" that counts the number of characters in the string in the following format:

string1_freq = '5A4B3C2D'  ## 5 A's, followed by 4 B's, 3 C's, and 2D's
string2_freq = '2C1B1A5D1B1A1C1D1C' 
string3_freq = '1D1A1B1C1B1E1D2C1A1E1D2B'

My problem:

How would I quickly create such a summary string?

My idea would be: create an empty list to keep track of the count. Then create a for loop which checks the next character. If there's a match, increase the count by +1 and move to the next character. Otherwise, append to end of the string 'count' + 'character identity'.

That's very inefficient in Python. Is there a quicker way (maybe using the functions below)?

There are several ways to count the elements of a string in python. I like collections.Counter, e.g.

from collections import Counter
counter_str1 = Counter(string1)
print(counter_str1['A']) # 5
print(counter_str1['B']) # 4
print(counter_str1['C']) # 3
print(counter_str1['D']) # 2

There's also str.count(sub[, start[, end]

Return the number of non-overlapping occurrences of substring sub in the range [start, end]. Optional arguments start and end are interpreted as in slice notation.

As an example:

print(string1.count('A'))  ## 5

score 2 · Answer 1 · answered Apr 07 '18 at 20:38

2

I would use itertools.groupby to group consecutive runs of the same letter. Then use a generator expression within join to create a string representation of the count and letter for each run.

from itertools import groupby
def summarize(s):
    return ''.join(str(sum(1 for _ in i[1])) + i[0] for i in groupby(s))

Examples

>>> summarize(string1)
'5A4B3C2D'
>>> summarize(string2)
'2C1B1A5D1B1A1C1D1C'
>>> summarize(string3)
'1D1A1B1C1B1E1D2C1A1E1D2B'

answered Apr 07 '18 at 20:38

Cory Kramer

114,268
16
167
218

Thanks for the answer. I'm sorry I'm lost, but could you provide some more details on how `summarize()` works? This is useful for learning. So `i[0]` in `i[0] for i in groupby(s))` is the letter of within `s`. `i[1]` is an `itertools._grouper ` object... – ShanZhengYang Apr 07 '18 at 20:51
1

Yes that's exactly right. `i[0]` is the letter, and `i[1]` is the group of consecutive letters, which we just count. The rest is just to turn everything into strings and concatenate them – Cory Kramer Apr 07 '18 at 21:01
1

More Pythonically (and more efficiently), unpack the two-tuple and use it by name instead of repeated indexing: `''.join(str(sum(1 for _ in grp)) + key for key, grp in groupby(s))`. For all but the largest groups, it's even faster to do `''.join(str(len(list(grp))) + key for key, grp in groupby(s))` – ShadowRanger Apr 07 '18 at 23:27
@ShadowRanger " For all but the largest groups, it's even faster". I would be interested if you could quantify this somewhat. At what point to do you notice a slowdown, or is it linear with the size of the group? – ShanZhengYang Apr 08 '18 at 00:01
@ShanZhengYang: `len(list(grp))` is basically always fastest unless it causes you to exceed RAM limits and end up paging out to disk (which seems highly unlikely in this case). There's more detail in [my old answer on this topic](https://stackoverflow.com/a/34404546/364696). – ShadowRanger Apr 08 '18 at 00:32
@ShadowRanger You're right. The `len(list(grp))` does should a performance (on strings approx. 20 characters in length) of around 15% – ShanZhengYang Apr 08 '18 at 19:24

score 2 · Accepted Answer · answered Apr 07 '18 at 22:06

2

The following code accomplishes the task without importing any modules.

def freq_map(s):
    num = 0         # number of adjacent, identical characters
    curr = s[0]     # current character being processed
    result = ''     # result of function

    for i in range(len(s)):
        if s[i] == curr:
            num += 1
        else:
            result += str(num) + curr
            curr = s[i]
            num = 1

    result += str(num) + curr

    return result

Note: Since you requested a solution based on performance, I suggest you use this code or a modified version of it.

I have executed rough performance test against the code provided by CoryKramer for reference. This code performed the same function in 58% of the time without using external modules. The snippet can be found here.

answered Apr 07 '18 at 22:06

Taylor Iserman

175
1
16

The performance tests surprise me, especially against a for loop. Do you know of any reason behind this? Is there some overhead to `itertools`? – ShanZhengYang Apr 07 '18 at 22:40
I would encourage you to perform similar tests to see for yourself and your implementation if the same is true. I am not certain why the implementation with `itertools` was more costly. My guess would be that it is a result of the wide functionality offered with the `itertools` module. – Taylor Iserman Apr 07 '18 at 22:51
It looks like this method is still the fastest. – ShanZhengYang Apr 08 '18 at 19:22

Python strings: quickly summarize the character count in order of appearance

2 Answers2