Rice coding in Cython

Question

Here is an implementation of well-known Rice coding (= Golomb code with M = 2^k http://en.wikipedia.org/wiki/Golomb_coding), widely used in compression algorithms, in Python.

Unfortunately it is rather slow. What could be the cause of this low speed ? (StringIO? the fact that data is written byte after byte?)

What would you recommand to use in order to speed it the encoding ? What trick would you use to speed it up with Cython ?

import struct
import StringIO

def put_bit(f, b):
    global buff, filled
    buff = buff | (b << (7-filled))
    if (filled == 7):
        f.write(struct.pack('B',buff))
        buff = 0
        filled = 0
    else:
        filled += 1

def rice_code(f, x, k):
    q = x / (1 << k)                       
    for i in range(q): 
        put_bit(f, 1)
    put_bit(f, 0)
    for i in range(k-1, -1, -1):
        put_bit(f, (x >> i) & 1)

def compress(L, k):
    f = StringIO.StringIO()
    global buff, filled
    buff = 0
    filled = 0
    for x in L:                # encode all numbers
        rice_code(f, x, k)
    for i in range(8-filled):  # write the last byte (if necessary pad with 1111...)  
        put_bit(f, 1)
    return f.getvalue()

if __name__ == '__main__':
    print struct.pack('BBB', 0b00010010, 0b00111001, 0b01111111)      #see http://fr.wikipedia.org/wiki/Codage_de_Rice#Exemples
    print compress([1,2,3,10],k = 3)

PS : Should this question be moved to https://codereview.stackexchange.com/ ?

I would recommend profiling it and seeing where the bottleneck(s) actually are. — groundlar, Mar 29 '14 at 22:51
@surfreak with which tool would you do this ? do you have a small example? — Basj, Mar 29 '14 at 23:06
There is already a `profile` module in Python, among other solutions. See the Python wiki [here](https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Profiling_Code) as well as [this question](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script). — groundlar, Mar 29 '14 at 23:15
Just for posteriety, this Python code is almost certainly ported from here: http://dmr.ath.cx/code/rice/ — Tim McNamara, Oct 05 '17 at 08:53

score 1 · Answer 1 · answered Mar 31 '14 at 01:59

I would use a C-style buffer instead of StringIO when building the compressed result and I would attempt to use only C-style temporaries in the encoding loop. I also noticed that you can pre-initialize your buffer to be filled with set bits ('1' bits), and this will make encoding values with a large quotient faster because you can simply skip over those bits in the output buffer. I rewrote the compress function with those things in mind, and measured the speed of the result, and it seems my version is more than ten times faster than your encoder, but the resulting code is less readable.

Here is my version:


cimport cpython.string
cimport libc.stdlib
cimport libc.string
import struct

cdef int BUFFER_SIZE = 4096

def compress(L, k):
    result = ''

    cdef unsigned cvalue
    cdef char *position
    cdef int bit, nbit
    cdef unsigned q, r
    cdef unsigned ck = k
    cdef unsigned mask = (1 << ck) - 1

    cdef char *buff = <char *>libc.stdlib.malloc(BUFFER_SIZE)
    if buff is NULL:
        raise MemoryError

    try:
        #  Initialize the buffer space is assumed to contain all set bits
        libc.string.memset(buff, 0xFF, BUFFER_SIZE)

        position = buff
        bit = 7

        for value in L:
            cvalue = value
            q = cvalue >> ck
            r = cvalue & mask

            #  Skip ahead some number of pre-set one bits for the quotient
            position += q / 8
            bit -= q % 8
            if bit < 0:
                bit += 8
                position += 1

                #  If we have gone off the end of the buffer, extract 
                #  the result and reset buffer pointers
                while position - buff >= BUFFER_SIZE:
                    block = cpython.string.PyString_FromStringAndSize(
                        buff, BUFFER_SIZE)
                    result = result + block

                    libc.string.memset(buff, 0xFF, BUFFER_SIZE)
                    position = position - BUFFER_SIZE

            #  Clear the final bit to indicate the end of the quotient
            position[0] = position[0] ^ (1 << bit)
            if bit > 0:
                bit = bit - 1
            else:
                position += 1
                bit = 7

                #  Check for buffer overflow
                if position - buff >= BUFFER_SIZE:
                    block = cpython.string.PyString_FromStringAndSize(
                        buff, BUFFER_SIZE)
                    result = result + block

                    libc.string.memset(buff, 0xFF, BUFFER_SIZE)
                    position = buff

            #  Encode the remainder bits one by one
            for nbit in xrange(k - 1, -1, -1):
                position[0] = (position[0] & ~(1 << bit)) | \
                              (((r >> nbit) & 1) << bit)

                if bit > 0:
                    bit = bit - 1
                else:
                    position += 1
                    bit = 7

                    #  Check for buffer overflow
                    if position - buff >= BUFFER_SIZE:
                        block = cpython.string.PyString_FromStringAndSize(
                            buff, BUFFER_SIZE)
                        result = result + block

                        libc.string.memset(buff, 0xFF, BUFFER_SIZE)
                        position = buff

        #  Advance if we have partially used the last byte
        if bit < 7:
            position = position + 1

        #  Extract the used portion of the buffer
        block = cpython.string.PyString_FromStringAndSize(
            buff, position - buff)
        result = result + block

        return result

    finally:
        libc.stdlib.free(buff)


def test():
    a = struct.pack('BBB', 0b00010010, 0b00111001, 0b01111111)      #see http://fr.wikipedia.org/wiki/Codage_de_Rice#Exemples
    b = compress([1,2,3,10],k = 3)

    assert a == b

FWIW, you can probably consider also using `cpython.array.array`s for this task, which are basically shallowly wrapped `malloc`s. See [here](http://stackoverflow.com/questions/18462785/what-is-the-recommended-way-of-allocating-memory-for-a-typed-memory-view/21054369#21054369) for some examples. — Veedrac, Apr 12 '14 at 13:53

Rice coding in Cython

1 Answers1