6

I am trying to reproduce in Python two examples (originally written in Java) that I found in a book.

The two functions check if a string contains repeated characters. The first function uses an integer (checker) as a bit vector, while the second function simply uses a list of booleans. I was expecting to have a better performance using the function with bits, but actually it performs worse.

Why is that? Did I write something wrong while "translating" from Java to Python?

Note: for simplicity we only use lowercase letters (a to z), especially for the bit vector function.

import sys
import timeit

def is_unique_chars_bit(my_str):
    checker = 0
    for char in my_str:
        val = ord(char) - ord('a')
        if ((checker & (1 << val)) > 0):
            return False
        checker |= (1 << val)
    return True

def is_unique_chars_list(my_str):
    if len(my_str) > 128:
        # Supposing we use ASCII, which only has 128 chars
        return False
    char_list = [False] * 128
    for char in my_str:
        val = ord(char)
        if char_list[val]:
            return False
        char_list[val] = True
    return True

if __name__ == '__main__':
    alphabet = "abcdefghijklmnopqrstuvwxyz"
    t_bit = timeit.Timer("is_unique_chars_bit('"+ alphabet +"')", "from __main__ import is_unique_chars_bit")
    t_list = timeit.Timer("is_unique_chars_list('"+ alphabet +"')", "from __main__ import is_unique_chars_list")
    print(t_bit.repeat(3, 200000))
    print(t_list.repeat(3, 200000))

Results:

[1.732477278999795, 1.7263494359995093, 1.7404333820004467]
[0.6785205180003686, 0.6759967380003218, 0.675434408000001]

The original Java functions are the following:

boolean isUniqueCharsBoolArray(String str) {
    if (str.length() > 128) return false;

    boolean[] char_set = new boolean[128];
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i);
        if (char_set[val]) {
            return false;
        }
        char_set[val] = true;
    }
    return true;
}

boolean isUniqueCharsBits(String str) {
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) -'a';
        if ((checker & (1 << val)) > 0) {
            return false;
        }
        checker |= (1 << val);
    }
    return true;
}
Tamas Hegedus
  • 28,755
  • 12
  • 63
  • 97
Kurt Bourbaki
  • 11,984
  • 6
  • 35
  • 53

1 Answers1

5

That is because integers are immutable reference classes in python. That means integer operations are slow in general. (This is true even for python2 ints) Look at the following line:

checker |= (1 << val)

If we expand the assignment we get:

checker = checker | (1 << val)

This single line allocates two new integers in memory. One for 1 << val and one for the bitwise or.

On the other hand, assigning an array element does not need to allocate objects, and that's the reason it's faster.

If you are searching for the fastest way to determine if a string has duplicate chars, this function is shorter and faster than the previous two (taken from "check duplicates in list"):

def is_unique_chars_set(my_str):
    return len(my_str) != len(set(my_str))

Timeit shows a 3x speedup (the last line is the new one):

>python3 test.py
[2.398782472571393, 2.3595238689519626, 2.37358552995787]
[1.0055455609592512, 1.007462347465074, 1.012826469701654]
[0.32564058749026437, 0.3303359144351621, 0.31772896318809885]

Note: The results may vary heavily if you use another python runtime, such as IronPython

Community
  • 1
  • 1
Tamas Hegedus
  • 28,755
  • 12
  • 63
  • 97
  • Thank you for providing an additional method: I did not consider it since I was only translating the two Java methods above, but it could be useful for reference. You said that *"mutating an array does not allocate objects at all"*; could you please give some reference about this? Are you talking about arrays in general, or just this specific list (in which we only change True/False values)? Moreover: is this supposed to produce opposite performance in Java? – Kurt Bourbaki Jan 07 '16 at 13:04
  • @KurtBourbaki (In the meantime I was trying to launch a line-by-line profiler, but didn't succeed.) I was talking about lists in general, and I meant item assignment. (`[].append` can result in memory allocation, but that is hidden by the list object itself) – Tamas Hegedus Jan 07 '16 at 13:13
  • So this is faster because we assign `True` or `False` values instead of allocating new integers, right? Do you know something about Java? E.g.: would the bit vector be more efficient than the boolean array? – Kurt Bourbaki Jan 07 '16 at 13:48
  • @KurtBourbaki I don't think there is a huge difference in access time between java `BitSet` and `boolean[]` (see http://stackoverflow.com/questions/605226/boolean-vs-bitset-which-is-more-efficient, it says boolean[] is faster). The main advantage of the earlier is using less memory, and has some neat bitwise operations predefined. – Tamas Hegedus Jan 07 '16 at 13:54