Finding dead code in large python project

Question

I've seen How can you find unused functions in Python code? but that's really old, and doesn't really answer my question.

I have a large python project with multiple libraries that are shared by multiple entry point scripts. This project has been accreting for many years with many authors, so there's a whole lot of dead code. You know the drill.

I know that finding all dead code is un-decidable. All I need is a tool that will find all functions that are not called anywhere. We're not doing anything fancy with calling functions based on the string of the function name, so I'm not worried about anything pathological...

I just installed pylint, but it appears to be file based, and not paying much attention to interfile dependencies, or even function dependencies.

Clearly, I could grep for def in all of the files, get all of the function names from that, and do a grep for each of those function names. I'm just hoping that there's something a little smarter than that out there already.

ETA: Please note that I don't expect or want something perfect. I know my halting-problem-proof just as well anyone (No really I taught theory of computation I know when I'm looking at something that is recursively enumerable). Any thing that tries to approximate it by actually running the code is going to take way too long. I just want something that syntactically goes through the code and says "This function is definitely used. This function MIGHT be used, and this function is definitely NOT used, no one else even seems to know it exists!" And the first two categories aren't important.

There already is one, that's been open for 2.5 years. Mainly, I think, because they know that in the hardest case, it is mathematically impossible. — Brian Postow, Mar 01 '12 at 22:11
Grepping sounds like it might be enough for you. Or maybe using [`parser`](http://docs.python.org/library/parser.html) would give you finer control. — Peter Wood, Mar 01 '12 at 23:20

Keith Gaughan · Answer 1 · 2017-09-29T17:29:20.910

45

You might want to try out vulture. It can't catch everything due to Python's dynamic nature, but it catches quite a bit without needing a full test suite like coverage.py and others need to work.

edited Sep 29 '17 at 17:29

answered Aug 14 '13 at 14:17

Keith Gaughan

21,367
3
32
30

1

This is perfect, glad a real answer was finally given. Vulture does the conservative dead code analysis that the original asker was looking for – deontologician Sep 30 '13 at 20:19
vulture seems great, but it doesn't really work with django... and unfortunately the django-coverage plugin is so old it wants long-since defunct dependencies. :( – szeitlin Mar 09 '16 at 19:47

score 16 · Answer 2 · edited May 23 '17 at 12:02

16

Try running Ned Batchelder's coverage.py.

Coverage.py is a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify code that could have been executed but was not.

edited May 23 '17 at 12:02

Community

1
1

answered Mar 01 '12 at 22:22

Peter Wood

23,859
5
60
99

5

That would involve running the code in all possible configurations. I don't want to have to do that. I don't want a complete list of all dead code. I just want a fast and dirty approximation. Leaving some dead code around is fine. – Brian Postow Mar 01 '12 at 22:38
5

" Leaving some dead code around is fine" - I don't really think that is EVER fine. – madCode Aug 19 '13 at 16:29
I would argue that leaving dead code is not fine, but hardly reachable. – Patrick Bassut Jan 23 '17 at 19:35

score 9 · Answer 3 · answered Mar 02 '12 at 16:58

It is very hard to determine which functions and methods are called without executing the code, even if the code doesn't do any fancy stuff. Plain function invocations are rather easy to detect, but method calls are really hard. Just a simple example:

class A(object):
    def f(self):
        pass

class B(A):
    def f(self):
        pass

a = []
a.append(A())
a.append(B())
a[1].f()

Nothing fancy going on here, but any script that tries to determine whether A.f() or B.f() is called will have a rather hard time to do so without actually executing the code.

While the above code doesn't do anything useful, it certainly uses patterns that appear in real code -- namely putting instances in containers. Real code will usually do even more complex things -- pickling and unpickling, hierarchical data structures, conditionals.

As stated before, just detecting plain function invocations of the form

function(...)

or

module.function(...)

will be rather easy. You can use the ast module to parse your source files. You will need to record all imports, and the names used to import other modules. You will also need to track top-level function definitions and the calls inside these functions. This will give you a dependency graph, and you can use NetworkX to detect the connected components of this graph.

While this might sound rather complex, it can probably done with less than 100 lines of code. Unfortunately, almost all major Python projects use classes and methods, so it will be of little help.

the discussion of classes and heterogeneous lists is a good one. I think that currently, I'm just going to say "function f gets used", so I'm going to assume that both are necessary. And, really if I have A and B which both have function F, and they AREN'T in the same heterogeneous list, or used interchangeably Then I have a problem with function naming... — Brian Postow, Mar 05 '12 at 16:18
Not sure if I'd really call this a problem with function naming. To give an example from the standard library: `dict.get()`, `queue.Queue.get()` and `pickle.Pickler.get()` are completely unrelated. Somehow the whole point of namespaces is to allow using the same name for different things. — Sven Marnach, Mar 05 '12 at 16:32
Ok, fair enough. I guess I'm assuming that things with standard names like get, set, equals, init, etc, will all get used somewhere, so I'm not really worried about those. But yes, you are correct. — Brian Postow, Mar 06 '12 at 16:48

score 6 · Answer 4 · edited Mar 02 '12 at 17:01

6

Here's the solution I'm using at least tentatively:

grep 'def ' *.py > defs
# ...
# edit defs so that it just contains the function names
# ...
for f in `cat defs` do
    cat $f >> defCounts
    cat *.py | grep -c $f >> defCounts
    echo >> defCounts
done

Then I look at the individual functions that have very few references (< 3 say)

it's ugly, and it only gives me approximate answers, but I think it's good enough for a start. What are you-all's thoughts?

edited Mar 02 '12 at 17:01

Sven Marnach

574,206
118
941
841

answered Mar 02 '12 at 16:13

Brian Postow

11,709
17
81
125

I should mention that the for loop is in bash syntax. – Brian Postow Mar 02 '12 at 16:14
note: this code completely doesn't work for a very large number of reasons.. not the least of which is that it's got syntax errors and is grepping for the wrong thing (should be just the function name, etc.) ... don't run it. but it's still a decent idea... simple string counting, etc. – Erik Aronesty Sep 11 '19 at 21:19
I said it's only approximate.... – Brian Postow Sep 11 '19 at 21:21
Modified version of script. nothing manual - https://gist.github.com/peeyushsrj/87f90919e489994572eafaf75c10c9d5 – peeyushsrj Dec 19 '19 at 18:28

score 4 · Answer 5 · answered Aug 06 '13 at 13:57

With the following line you can list all function definitions that are obviously not used as an attribute, a function call, a decorator or a return value. So it is approximately what you are looking for. It is not perfect, it is slow, but I never got any false positives. (With linux you have to replace ack with ack-grep)

for f in $(ack --python --ignore-dir tests -h --noheading "def ([^_][^(]*).*\):\s*$" --output '$1' | sort| uniq); do c=$(ack --python -ch "^\s*(|[^#].*)(@|return\s+|\S*\.|.*=\s*|)"'(?<!def\s)'"$f\b"); [ $c == 0 ] && (echo -n "$f: "; ack --python --noheading "$f\b"); done

You could improve the speed by replacing ack with `grep --include "*.py"` — Jonathan Hartley, Oct 23 '17 at 17:17

score 1 · Answer 6 · answered Mar 01 '12 at 22:23

1

If you have your code covered with a lot of tests (it is quite useful at all), run them with code-coverage plugin and you can see unused code then .)

answered Mar 01 '12 at 22:23

yedpodtrzitko

9,035
2
40
42

9

Except you might have tests for the dead code. ie it's not used anywhere in the rest of the system – John La Rooy Mar 01 '12 at 22:27
I have very few tests within the code. Adding unit tests is something that I'm supposed to do after I figure out what actually needs to be tested... if I test all of the dead code, I'll be effectively bringing it back to life, which is not what I want. – Brian Postow Mar 01 '12 at 22:40

score 1 · Answer 7 · answered Mar 02 '12 at 08:14

IMO that could be achieved pretty quickly with a simple pylint plugin that :

remember each analysed function / method (/ class ?) in a S1 set
track each called function / method (/ class ?) in a S2 set
display S1 - S2 in a report

Then you would have to call pylint on all your code base to get something that make sense. Of course as said this would need to checked, as there may have been inference failures or such that would introduce false positive. Anyway that would probably greatly reduce the number of grep to be done.

I've not much time to do it myself yet but anyone would find help on the python-projects@logilab.org mailing list.

FWIW I've written pylint plugins in the past (not to do this, but to do other things) and while not trivial (you write assertions on the examined code's AST) they were surprisingly easier to get working than I would have expected. — Jonathan Hartley, Oct 23 '17 at 17:22

Finding dead code in large python project

7 Answers7

Linked