What is the best way to convert raw binary data to a custom base in python?

Question

I need to convert some data to base 29 before processing and I'm using this:

import string

def datatobase(data, base):
    digs = string.digits + string.lowercase + string.uppercase
    if base > len(digs):
        return None
    digits = []
    x = int(data.encode("hex"), 16)
    while x:
        digits.append(digs[x % base])
        x /= base
    digits.reverse()
    return ''.join(digits)

Problem is that this small code is slowing my program too much so what would you do to replace it?

A custom answer for base 29 only would be great too!

I think this is a duplicate of http://stackoverflow.com/questions/2267362/convert-integer-to-a-string-in-a-given-numeric-base-in-python. — bitsplit, Oct 15 '14 at 13:21
If you want to do this quickly, like in linear time, then I believe you need to change your target base to something that is a power of 2, like 32 or 16. — Anton, Oct 15 '14 at 14:18
Please show an example of the ```data``` you are trying to convert. — wwii, Oct 16 '14 at 03:38
I don't think this is a duplicate question, I've read the page bitsplit: is pointing before posting but here I have a real performance issue :) Antom wwii: The data I'm converting doesn't matter and I also made a previous version of my script using power of 2 (and yes this is obviously much faster). I really need using base 29 conversion now. — fbparis, Oct 16 '14 at 04:52

wwii · Answer 1 · 2014-10-16T12:04:54.610

0

Base 29, only, solution for an int argument.

Recursive:

s = '0123456789ABCDEFGHIJKLMNOPQRS'
def foo(n, s=s):
    if n < 29:
        return s[n]
    a, b = divmod(n, 29)
    return foo(a) + s[b]

Regular:

def foo(n, s=s):    
    x = ''
    while n >= 29:
        n, b = divmod(n, 29)
        x += s[b]
    x += s[n]
    return x[::-1]

edited Oct 16 '14 at 12:04

answered Oct 15 '14 at 12:21

wwii

23,232
7
37
77

Isn't this basically the same algorithm as OP's? – Anton Oct 15 '14 at 12:46
@Anton - basically the same, however, this answer includes some bugs that the OP's algorithm doesn't :) 1) base 29 symbols range from `0-9` and `A-S`, and 2) the maximum decimal value for a single "digit" in base 29 is 28, not 29. – mhawke Oct 15 '14 at 13:51
@Anton I took OP's problem as a performance issue - I tried to improve the performance but I didn't compare it. Bugs fixed, thnx – wwii Oct 15 '14 at 18:58

wenzul · Answer 2 · 2014-10-16T05:47:41.283

0

If you take care of runtime... This version is 2.8 times faster then yours and 7% faster then @wwii.

def bin2base29(n):
    s = '0123456789ABCDEFGHIJKLMNOPQRS'
    return s[n] if n < 29 else bin2base29(n / 29) + s[n % 29]

Here is my final iterative and fastest solution of my approaches adapted from @wwii.

def bin2base29(n):
    s = '0123456789ABCDEFGHIJKLMNOPQRS'
    x = ''
    while n > 0:
        x = s[n % 29] + x
        n /= 29
    return x

edited Oct 16 '14 at 05:47

answered Oct 15 '14 at 13:27

wenzul

3,948
2
21
33

`s` is too short - it is missing the `S`. Also, maximum `n` for single digit in base 29 is 28, not 29. – mhawke Oct 15 '14 at 13:56
@mhawke Thanks you are right. My examples did all work, tried it with 28,29 now... :) – wenzul Oct 15 '14 at 14:19
How did you measure your algorithm's performance? It's not 29 times faster than the OP's (besides, "29" seems a little coincidental!) – mhawke Oct 15 '14 at 14:28
@mhawke Thank you for your tenacity. I think I mixed the repeats in timeit or used to few repetitions. I correct it and it seems also more realistic to me now. Do you get similar numbers? – wenzul Oct 15 '14 at 14:39
Actually, I found the OP's to be the fastest. Your's, mine, and wwii's iterative one are very similar. – mhawke Oct 15 '14 at 14:47
@mhawke not for me. As a reference I use datatobase('2347878234', 62) and 100000 times. – wenzul Oct 15 '14 at 14:56
@wenzul - my edits seem to make them more competitive. – wwii Oct 16 '14 at 03:35
@wwii In the first try I thought mine is more then 7% faster, now yours is almost equal. It also depends on the length of the input. The longer it will be the faster the iteration version will be. You could simple improve yours by avoiding `divmod` or hold the reference in a local variable. There is also an error in your paste. I will correct it. Now you beat the recursive version. – wenzul Oct 16 '14 at 05:39
`hold the reference in a local variable` - please explain. Both versions appear competitive with yours despite using `divmod` - please explain why avoiding `divmod` would help. – wwii Oct 16 '14 at 12:13
@wwii If you store `dm = divmod` once in a local variable (in the iterative case) you can save the lookup for the built-in function each loop cycle. Because reading local variables is actually faster than lookups for global ones. And calling a function in a loop is also more expensive than using the arithmetic operations. You can read about that [here](https://www.python.org/doc/essays/list2str/). – wenzul Oct 16 '14 at 16:41
So, if the function definition includes a default argument for a resource, that resource is *placed* in the function's local scope when the function is initially *interpreted* and that should have a slight performance improvement? ... I get about 2% converting a number that takes twenty iterations. ```divmod``` iterative versions are still beating your non-```divmod``` version - maybe calling a ```built-in``` isn't very expensive. Mine has three operations in the loop and yours has five... – wwii Oct 17 '14 at 17:02
@wwii I didn't get your point. It not depends on a default argument. Fixed arguments will be passed to a function like others. Try `python -m timeit "n=10/2;r=10%2"`, `python -m timeit "n,r = divmod(10,2)"` and `python -m timeit -s "dm=divmod" "n,r = dm(10,2)"` as an approximate comparison. If you are inside a function the lookup for built-ins will take more time than already in the globals like here. _10000000 loops: 0.0486 / 0.178 / 0.162 usec per loop_. It greatly depends on the loop count and length of input respectively. If this won't satisfy please give your test cases. – wenzul Oct 17 '14 at 20:06
I meant: ```def foo(n, s=s, dm = divmod):``` – wwii Oct 18 '14 at 00:02
@wwii No, default arguments are also slower than just leave it out and define it local. I meant `def foo(n): s = ...; dm = divmod; while...`. There is an example [here](https://www.python.org/doc/essays/list2str/)... – wenzul Oct 18 '14 at 01:09
@wwii Confusing. Actually, for me bin2base29 with exactly your code paste is the fastest also. Windows Python 2.7.6 64 Bit. `v1 4.94157717086, v2 4.86761055184, v3 4.89439396469, bin2base29 4.75006779537` – wenzul Oct 18 '14 at 14:06

mhawke · Answer 3 · 2014-10-15T14:33:41.797

0

If you are not averse to using a third-party package, numpy.base_repr() is a very convenient way to do your conversions:

import os
import numpy

def datatobase(data, base):
    n = int(data.encode('hex'), 16)
    return numpy.base_repr(n, base)

>>> data = os.urandom(32)
>>> data
'\xfcBs\x82\xa8&\x18\xaaK\x8c$\x0fZ\x95\xc0aA%\x93\x91\xcc\x8a\xa8\xfdbk\xeb\x14\x15\x06\xbag'

>>> datatobase(data, 29)
'A8FB42CHLNEIOOE75AG773EKGBA69QP89PANAF8ROH2GA1LF3CC5H'
>>> datatobase(data, 16)
'FC427382A82618AA4B8C240F5A95C06141259391CC8AA8FD626BEB141506BA67'

You'd need to profile to see whether this provides adequate performance for your application.

Update

Profiling shows that numpy.base_repr() is slower than the OP's implementation. This is because the numpy implementation is basically the same algorithm implemented in Python, with the addition of optional zero padding.

edited Oct 15 '14 at 14:33

answered Oct 15 '14 at 13:46

mhawke

84,695
9
117
138

Thanks, I wanted to edit my question because this is what I'm doing now. Gonna profiling soon to see if it's really better or not. – fbparis Oct 16 '14 at 04:50
It's 40% worse than the iterative one. – wenzul Oct 16 '14 at 05:44
@wenzul: The timings are greatly affected by the length of the binary string. On short strings your iterative algorithm is much faster, on longer strings (e.g. 256 bytes) it's actually slower. I find the OP's algorithm to be the fastest in most cases. One other thing, probably you are not including the cost of the conversion from a string of binary data to an int - this is worth approx 7% on a short string? – mhawke Oct 16 '14 at 10:28

What is the best way to convert raw binary data to a custom base in python?

3 Answers3