Building a number compression algorithm

Question

I am trying to solve this problem:

Consider a simple number compression algorithm that works as follows:

111155522500 -> [('1', 4), ('5', 3), ('2', 2), ('5', 1), ('O', 2)]

The algorithm goes from left to right through each digit and returns a list of two-element tuples. Each tuple consists of a digit and the number of repetitions of a given digit until the next, different digit in the number is encountered.

Implement a function called compress () that compresses number as described above.

Examples:

[IN]: compress(lll)
[OUT]: [('l', 3)]
[IN]: conpress(1000000)
[OUT]: [('1', 1), ('O', 6)]
[IN]: conpress(10005000)
[OUT]: [('1', 1), ('O', 3), ('5', 1), ('O', 3)]

My code:

#para funcionar, é e necessário acrescentar um caracter que não  existe ao final da string!
def compress(elemento):
    elemento = str(elemento)
    elemento = elemento +"§"# the code only works if I add some char not in the string
    saida = []
    #cont = 0
    contagem = []
    
    for index,ele in enumerate(elemento):
        ele = str(ele)
        #print(f"contagem: {contagem}")
        if len(contagem) == 0 or ele in contagem:
            contagem.append(ele)
       
        if ele not in contagem :
            saida.append((elemento[index-1],len(contagem)) )
            contagem = []
            contagem.append(ele)                      
        
    return saida



compress('1000500011')

OUT:

[('1', 1), ('0', 3), ('5', 1), ('0', 3), ('1', 2)]

The problem: The code only works if I add some char not in the string as "§" Could someone help to solve this problem using the itertools built-in module and the groupby and without this modules?

what's missing, to avoid the end-of-sequence special character, is `saida.append((elemento[index-1],len(contagem)) )` after your loop has completed — njzk2, Jul 05 '22 at 10:50
also, `contagem` is usually replaced with a current character pointer, and a current count of occurences. Otherwise `ele in contagem` can become expensive — njzk2, Jul 05 '22 at 10:51

score 2 · Answer 1 · answered Jul 05 '22 at 10:27

2

Pretty straight forward with itertools.

def compress(number):
    return [
        (c, sum(1 for _ in g))
        for c, g in itertools.groupby(str(number))
    ]

demo:

>>> compress(10005000)
[('1', 1), ('0', 3), ('5', 1), ('0', 3)]

answered Jul 05 '22 at 10:27

timgeb

76,762
20
123
145

thanks. Could you explain the code? – Ed S Jul 05 '22 at 10:39
1

@EdS sure, if you have a specific question? – timgeb Jul 05 '22 at 10:42
I did not understand the use of itertools! – Ed S Jul 05 '22 at 10:43
is there a specific part of the documentation that you don't understand? – njzk2 Jul 05 '22 at 10:49
2

@EdS https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby – timgeb Jul 05 '22 at 10:49

Daniel Hao · Answer 2 · 2022-07-05T11:03:15.973

The earlier post is a perfect example of groupby. But if you're open to more_itertools, here is another version: run_length. It basically doing something similar with groupby - compresses an iterable with run-length encoding. It yields groups of repeated items with the count of how many times they were repeated:

It even provides corresponding decode to decompress the compressed one. Please check it out. So there is no need to reinvent the wheels...

from more_itertools import run_length

num = 10005000

compressed = list(run_length.encode(str(num)))

Output:

[('1', 1), ('0', 3), ('5', 1), ('0', 3)]

So running it in IDE like this:
>>> num = 10005000
>>> compressed = list(run_length.encode(str(num)))
>>> compressed
[('1', 1), ('0', 3), ('5', 1), ('0', 3)]
>>> run_length.decode(compressed)
<itertools.chain object at 0x039931A8>
>>> list(run_length.decode(compressed))
['1', '0', '0', '0', '5', '0', '0', '0']
>>> num = [int(x) for x in run_length.decode(compressed)]
>>> num
[1, 0, 0, 0, 5, 0, 0, 0]

DarrylG · Accepted Answer · 2022-07-06T12:13:12.063

Corrections to your code with '> Edit---' comments to highlight changes.

Code

def compress(elemento):
    elemento = str(elemento)  # > Edit-----not necessary, since already a string
    
    # > Edit-----Remove adding last character
    #elemento = elemento +"§"# the code only works if I add some char not in the string
    
    saida = []
    #cont = 0
    contagem = []
    
    for index,ele in enumerate(elemento):

        #ele = str(ele)   # > Edit-----not necessary, since already a string
        #print(f"contagem: {contagem}")
        #f len(contagem) == 0 or ele in contagem:     # > Edit-----Inefficient to check entire list
        if len(contagem) == 0 or ele == contagem[-1]: #            More efficient just to compare last element in list
            contagem.append(ele)
       
        #if ele not in contagem :   # > Edit-----Inefficient to check entire list
        if ele != contagem[-1]:     #             More efficient just to compare last element in list
            #saida.append((elemento[index-1],len(contagem)) ) # EDIT---incorrect index
            saida.append((elemento[index-1],len(contagem)) )  # EDIT---index reference correct character
            contagem = []
            contagem.append(ele)   
          
    # > Edit-----Not having next if block caused your error
    #             Add last contagem  if it exists
    if contagem:
        #saida.append((elemento[index-1], len(contagem)) )
        saida.append((elemento[index-1], len(contagem)) )  # EDIT---incorrect index
        saida.append((elemento[index], len(contagem)) )    # EDIT---index references to correct character
    return saida

compress('1000500011')

Output

[('1', 1), ('0', 3), ('5', 1), ('0', 3), ('1', 2)]

Code Refactoring

Code can be simplified to the following:

def compress(elements):
    stack = []
    for ele in elements:
        if not stack or stack[-1][0] != ele:
            stack.append((ele, 1)) 
        else:
            stack[-1] = (stack[-1][0], stack[-1][1] + 1)
            
    return stack

thanks! There is a bug in my code corrected by you: for exemple: IN: compress('1000500016') _> OUT: [('1', 1), ('0', 3), ('5', 1), ('0', 3), ('1', 1), ('1', 1)]. Could you help? — Ed S, Jul 06 '22 at 11:35
@EdS -- made correction to your corrected code. To get the current character we need `elemento[index]`, not `elemento[index-1]`, This is because in `for index,ele in enumerate(elemento):` index goes from 0 (first chaaracter) to len(elemento)-1 (last character). Let me know if there are other questions. — DarrylG, Jul 06 '22 at 12:15

score 1 · Answer 4 · answered Jul 06 '22 at 23:15

An itertools approach:

from itertools import groupby

def compress(n):
   return [(grp_id, len(list(grp))) for grp_id, grp in groupby(str(n))]

n = 111155522500
print(compress(n))

Just for fun, a decompress function:

def decompress(n_compress):
    return int(''.join(map(str.__mul__, *zip(*n_compress))))

Building a number compression algorithm

4 Answers4