Is it possible to check for source code equality of two functions by ignoring comments, formatting and name of the functions?

Question

I've been using my_function.__code__.co_code to check whether two functions have the same source code or not (as suggested e.g. here). However, it turns out that this doesn't do exactly what I was expecting. Let's take this example.

def f(a, b):
    return a
    return a


def g(a, b):
    return a


if __name__ == '__main__':
    print(f.__code__.co_code)
    print(g.__code__.co_code)
    print("f.__code__.co_code == g.__code__.co_code?",
          f.__code__.co_code == g.__code__.co_code)

The functions are logically equivalent, but they have different code, so I was expecting f.__code__.co_code == g.__code__.co_code to return False, but it returns True. This is probably because Python understands that f can be simplified to g.

However, would there be a way to stop Python from doing that, i.e. if the functions really contain different code (although they may be logically equivalent), f.__code__.co_code == g.__code__.co_code returns False? Or would there be any solution to my problem?

Essentially, I am looking for a function/property that would return a (preferably hashable) representation of a function, which, if compared to the representation of another function, would be different, unless the functions contain exactly the same code. It's fine that the functions have different names, i.e. if they have different names, but the same code, it should return True (the functions are equal).

f is g is not what I am looking for. It would return False for these functions

def f(a, b):
    return a


def g(a, b):
    return a

However, they contain the same code!

I would also like to note that I want spaces and comments to be ignored. So, the following two functions would be equal

def   f(  a , b):
    """Good job... hm, it was a good day..."""
    return a
    # NICE


def g(a,    b  ):
    #    HELLLOOOOOOO
    return         a

There isn't a bytecode to represent the `pass` keyword, since it only exists in the grammar to prevent function bodies from being empty. That isn't strictly necessary in the first place, but it was felt that requiring this would help avoid errors overall. So the functions not only "are logically equivalent", but cannot possibly be compiled to anything different, even without optimizations. — Karl Knechtel, Aug 03 '21 at 13:48
@KarlKnechtel Ok, but this problem happens with other functions. I will provide another example so that we don't discuss this issue. That's not really my issue. — nbro, Aug 03 '21 at 13:49
"Essentially, I am looking for a function that would return a preferably hashable representation of a function" What *problem are you hoping to solve* in this way? — Karl Knechtel, Aug 03 '21 at 13:49
@KarlKnechtel I cannot give details of the main problem I am trying to solve (as it's a research problem), but I need to be able to solve this intermediate problem, if possible. — nbro, Aug 03 '21 at 13:53
will `inspect.getsource` help? https://stackoverflow.com/questions/427453/how-can-i-get-the-source-code-of-a-python-function — Yulia V, Aug 03 '21 at 13:53
@YuliaV `inspect.getsource` returns you the source code, but I already have the source code (and I can even read that into a string by reading the module, i.e. what `inspect.getsource` does in the end)! That's not my problem. — nbro, Aug 03 '21 at 13:54
Can you please clarify what you mean by "code"? Do you mean bytecode, source code, or something else like the corresponding abstract syntax tree? Do you have access to the source code of all functions you want to check? — MisterMiyagi, Aug 03 '21 at 13:59
@MisterMiyagi I meant source code. I have access to the source code of the functions. See my last edit. — nbro, Aug 03 '21 at 14:01
Erm, okay. Are you absolutely sure? Comparing source code does imply being sensitive to things like whitespace, comments, alignment and such – at least two of which the last part excludes. The requirement seems closer to an AST. — MisterMiyagi, Aug 03 '21 at 14:03
@MisterMiyagi Well, essentially, the function that I am looking for would first get rid of all comments and format the functions in the same way, then compare the source code. `__code__.co_code` does that, but, again, it does even more, which I didn't want. — nbro, Aug 03 '21 at 14:04
@nbro You probably want the AST, which is a representation of the code after parsing, but before code generation. (The AST for `f` would include nodes for both `return` statements, but the code generator does indeed determine that the second is unreachable and doesn't generate any code for it.) — chepner, Aug 03 '21 at 14:12
@nbro Please take a moment to review some of the terms mentioned here, and specify *exactly* what you want. Be mindful of corner cases, e.g. is ``def f(x): return x and f(x-1)`` equal to ``def g(x): return x and g(x-1)`` or ``def g(x): return x and f(x-1)``? What you want to do is likely possible without much effort, but the approach will vary wildly depending on your exact requirements. — MisterMiyagi, Aug 03 '21 at 14:16
@chepner I've been using Python's `ast` module for a while. It would complicate my task. The only solution right now that would work is: 1. get the source code of the two functions, 2. remove all comments, 3. make all variables with different names at the same position with the same names, 4. reformat the function so that there aren't additional spaces, etc., 5. make the name of the two functions equal, compare the resulting strings. — nbro, Aug 03 '21 at 14:17
But this would mean that I need to do some more work to implement there. I was wondering if someone has already done that, i.e. there exists a function already implemented that would do this for me. — nbro, Aug 03 '21 at 14:17
@MisterMiyagi Those would be different functions. Different source code. If you call a different function, that means different source code. — nbro, Aug 03 '21 at 14:18
@nbro *Which* of those would be different functions? According to your current requirements, there should be at least one pair that is equal. — MisterMiyagi, Aug 03 '21 at 14:18
So you don't want to *ignore* the name of the functions, you want to *normalize* them? I.e. you are looking purely for the logical meaning of the function? — MisterMiyagi, Aug 03 '21 at 14:20
@MisterMiyagi I want to ignore the names of the functions, but, in your 3rd example, `f` calls `g`, which may be different from `f`. If `f` and`g` are equal (e.g. an alias), then yes you could rename `f` to `g` or viceversa. But in your examples you defined `g` twice. — nbro, Aug 03 '21 at 14:22
In the third example, ``g`` calls ``f``. The point is that the first ``g`` has the same *meaning* as ``f`` while the second has the same *content*, and your description currently doesn't say which of these the "ignore the name" rule cares about. — MisterMiyagi, Aug 03 '21 at 14:25
@MisterMiyagi The third does not really have the same content as it calls f, which is different from `g` there, unless you defined `f = g` as an alias. In that case, yes, they would be equal. But you can ignore that these things will happen (at least for my case). — nbro, Aug 03 '21 at 14:29
Side note although the OP probably won't care about this anymore, using AST will make `1+2*3` equivalent to `1+(2*3)`. — user202729, Jun 18 '22 at 10:29

chepner · Answer 1 · 2021-08-03T14:40:45.980

Code isn't generated directly from source code. The source code is first parsed to produce an abstract syntax tree (AST). The AST is used to generate the byte code.

>>> print(ast.dump(ast.parse("""def f(a,b):
...   return a
...   return a"""), indent=2)
... )
Module(
  body=[
    FunctionDef(
      name='f',
      args=arguments(
        posonlyargs=[],
        args=[
          arg(arg='a'),
          arg(arg='b')],
        kwonlyargs=[],
        kw_defaults=[],
        defaults=[]),
      body=[
        Return(
          value=Name(id='a', ctx=Load())),
        Return(
          value=Name(id='a', ctx=Load()))],
      decorator_list=[])],
  type_ignores=[])

You'll have to write your own code to do any node-by-node comparisons of two ASTs. The ast module provides a NodeVisitor class to assist in traversing a tree.

Taking a step back: you aren't interested in whether two pieces of code are identical, only if they are isomorphic under some structure-preserving transformation. Parsing is fine, as the transformation from source code to AST preserves the structure you want preserved. Code generation, however, does not preserve the structure when transferring the AST into a linear sequence of byte codes. (It only preserves the semantics.) For instance, it will do dead code elimination, constant folding, etc.

So what you want is a very specific transformation of one AST to another, normalized AST. Once you have normalized your two ASTs, you can compare them.

One transformation you expressed interest in was normalization of variable names (for example, def f(): pass and def g(): pass should be considered isomorphic). Something you may be interested in is applying de Bruijn numbering to the variables, which assigns a unique integer to each unique variable, regardless of the variables names.

There may be other similar transformations you are interested in. Unfortunately, you are going to have to write (or research) these yourself.

Is it possible to check for source code equality of two functions by ignoring comments, formatting and name of the functions?

1 Answers1