1

I'm doing a few algorithm questions and ran into a strange issue in one of my test cases. The get_num_groups function returns an integer that the assert statements verify.

The first assert statement seems to pass intermittently after I introduced the line list(set(words)) in the function.

Code:

def are_similar(w1, w2):
    '''
        returns true if swapping the positions of exactly 2 characters
        in a word makes the given words equal
    '''
    if len(w1) != len(w2):
        return False

    mis_matches = 0
    locs = []
    n = len(w1)

    for i in range(n):
        if w1[i] != w2[i]:
            mis_matches += 1
            locs.append(i)

    if mis_matches != 2:
        return False

    # check if the 2 letter swap will result in equal words
    return w1[locs[0]] == w2[locs[1]] and w1[locs[1]] == w2[locs[0]]


def get_group(graph, n, word_i):
    # for a word at the word_i'th position, get all words connected to it

    group = set()
    stack = [word_i]

    while stack:
        i = stack.pop()

        # view all nodes connected directly/indirectly to i'th node
        for neigh in range(i, n):
            if graph[i][neigh] != 0:
                group.add(neigh)
                stack.append(neigh)
                graph[neigh][i] = 0
                graph[i][neigh] = 0

    return group


def get_num_groups(words):
    '''
        - creates a graph of words that are connected if similar.
        - does a dfs/bfs traversal to collect count of the disconnected sub graphs
    '''
    words = list(set(words)) # causes a strange intermittent bug. in first testcase
    n = len(words)
    # n X n adj matrix
    graph = [[0] * n for _ in range(n)]

    # create graph
    for i in range(n):
        graph[i][i] = 1 # word is connected to itself
        wi = words[i]
        for j in range(i+1, n):
            wj = words[j]
            s = are_similar(wi, wj)
            if s:
                graph[i][j] = 1
                graph[j][i] = 1

    num_groups = 0
    for i in range(n):
        g = get_group(graph, n, i)
        if g:
            num_groups += 1

    return num_groups


if __name__ == '__main__':
    # === Test ===
    wds = ["tars", "rats", "star", "arts"]
    num_ways = get_num_groups(wds)

    inp = (wds)
    res = (num_ways)
    expected = (2)
    assert res == expected, '[Fail] {} -> {}\nexpected:{}'.format(inp, res, expected)
    print('[+]')

    # === Test ===
    wds = []
    num_ways = get_num_groups(wds)

    inp = (wds)
    res = (num_ways)
    expected = (0)
    assert res == expected, '[Fail] {} -> {}\nexpected:{}'.format(inp, res, expected)
    print('[+]')

    # === Test ===
    wds = ["ss", "ss"]
    num_ways = get_num_groups(wds)

    inp = (wds)
    res = (num_ways)
    expected = (1)
    assert res == expected, '[Fail] {} -> {}\nexpected:{}'.format(inp, res, expected)
    print('[+]')

    # === Test ===
    wds = ["ssa", "ssb"]
    num_ways = get_num_groups(wds)

    inp = (wds)
    res = (num_ways)
    expected = (2)
    assert res == expected, '[Fail] {} -> {}\nexpected:{}'.format(inp, res, expected)
    print('[+]')

Unexpected output when run with command while true; do python3 file.py; done:

+]
[+]
[+]
[+]
Traceback (most recent call last):
  File "p.py", line 110, in <module>
    assert res == expected, '[Fail] {} -> {}\nexpected:{}'.format(inp, res, expected)
AssertionError: [Fail] ['tars', 'rats', 'star', 'arts'] -> 3
expected:2
Traceback (most recent call last):
  File "p.py", line 110, in <module>
    assert res == expected, '[Fail] {} -> {}\nexpected:{}'.format(inp, res, expected)
AssertionError: [Fail] ['tars', 'rats', 'star', 'arts'] -> 3
expected:2
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]
[+]

It is extremely unlikely, but, could there be a bug in python with the way list(set(<some array>)) is done?

krk
  • 19
  • 4
  • 2
    'It is extremely unlikely, but, could there be a bug in python with the way `list(set())` is done?" It is possible, since human beings wrote Python and human beings are fallible. It is much more likely that the bug is in your code rather than in the implementation of such fundamental Python built-ins. – John Coleman Dec 06 '19 at 02:01
  • Thanks @JohnColeman. I'm not using any threading or parallel computation. The input/expected output are static. Not sure why the behavior is not consistent between runs. – krk Dec 06 '19 at 21:39
  • sets don't have a deterministic order. I think that under the hood `set()` depends on hashing which in Python 3 is nondeterministic. The way you are calling the code probably involves reseeding the underlying random number generator. My guess is that there is a bug which only shows up in some order of iteration of the set, and this order differs between runs. It is an interesting question. – John Coleman Dec 06 '19 at 22:13
  • Is there a way to get a Python maintainers' attention to this issue? Their prospective would be very valuable. (ps. I assume I'm doing something wrong. Just don't know what it is) – krk Dec 07 '19 at 03:08
  • Printing the value of words before and after `words = list(set(words))` shows that `list` and `set` work as expected - the same words are returned, but the order may have changed. – snakecharmerb Dec 07 '19 at 08:03
  • 1) The Python maintainers are aware of this feature, which they do not regard as a bug. It was a design decision of theirs to change their hash function and not an oversight. 2) sets are not an ordered type. If your code was implicitly assuming something about the order of elements in `list(set(words))` then your code is buggy. If your goal is to remove duplicates but retain order of first appearance, don't use `list(set(words))` to do so. 3) Why loop your code at the command line? You could wrap the main function in a Python loop and run it all in one Python process. – John Coleman Dec 07 '19 at 12:41
  • If I am correct that the intention was to remove duplicates but otherwise retain order, see [python - How do you remove duplicates from a list whilst preserving order?](https://stackoverflow.com/q/480214/4996248) for some alternatives to `list(set(words))` – John Coleman Dec 07 '19 at 12:46
  • The program does not assume the order will be maintained. The high level idea is to build a graph of "connected" words (2 words are connected if `are_similar` returns true) and then return the number of disconnected subgraphs. Unless I'm missing something about implicitly assumed ordering, I don't see why the `list(set(arr))` option would change behaviour. ps. looping the the bash command to show the program behaviour is not consistent. The intended algorithm is only meant to execute once on each test. – krk Dec 09 '19 at 04:25

0 Answers0