Automatic low level testing generation based on high level desired behavioral examples

Question

Top Level Problem

Our team has inherited a very large and brittle python 2 (and C,C++, few others) codebase that is very difficult and costly to update. Tons of dependencies. Very few tests. Adding behavior improvement and converting to python 3 both have appeared to be monumental tasks. Even making small changes for a new release we've had to revert many times as it's broken something.

It's a story of insufficient testing and its major technical debt.

Still, the project is so big and helpful, that it seems a no brainer to update it than re-invent everything it does.

Sub Problem

How to add a massive amount of missing small tests. How can we automatically generate even simple input/output acceptance unit tests from the high level user acceptance tests?

Attempted Solution

There are about 50 large high level behavioral tests that this codebase needs to handle. Unfortunately, it takes days to run them all, not seconds. These exercise all the code we care the most about, but they are just too slow. (Also a nerdy observation, 80% of the same code is exercised in each one). Is there a way to automatically generate the input/output unit tests from automatic stack examination while running these?

In other words, I have high level tests, but I would like to automatically create low level unit and integration tests based on the execution of these high level tests.

Mirroring the high level tests with unit tests does exactly zero for added code coverage, but what it does do is make the tests far faster and far less brittle. It will allow quick and confident refactoring of the pieces.

I'm very familiar with using TDD to mitigate this massive brittle blob issue in the first place as it actually speeds up development in a lot of cases and prevents this issue, but this is a sort of unique beast of a problem to solve as the codebase already exists and "works" ;).

Any automated test tool tips? I googled around a lot, and I found some things that may work for C, but I can't find anything for python to generate pytests/unittest/nose or whatever. I don't care what python test framework it uses (although would prefer pytest). I must be searching the wrong terms as it seems unbelievable a test generation tool doesn't exist for python.

Could you use a [proxy pattern](https://stackoverflow.com/questions/13756757/python-capture-method-call-and-parameters) to capture the calls to the parts of the python code you want to test? There's a slack channel on [python testing](http://pythontesting.net/slack/) which might be useful. — lloyd, Apr 15 '19 at 00:43
I would start with an open source coverage tool and extend it to serialize all parameters as well as the return value of given methods. Then you can use a generic test class to call the methods with the previous serialized objects and compare the return value with your saved reference. It will not work for every method but it could be a start. — Jens Dibbern, Apr 18 '19 at 19:20

score 1 · Accepted Answer · answered Apr 20 '19 at 22:49

First, good you have already some higher level test running. Parallelize their execution, run each on different hardware, buy faster hardware if possible - as the refactoring task you are about to handle seems to be huge, this will still be the cheapest way of doing it. Consider breaking down these higher level tests into smaller ones.

Second, as lloyd has mentioned, for the components you plan to refactor, identify the component's boundaries and during execution of the higher level tests record input and output values at the boundaries. With some scripting, you may be able to transform the recorded values into a starting point for unit-test code. In only rare cases this will end up being useful unit-tests immediately: Normally you will need to do some non-trivial architectural analysis and probably re-design:

What should be the units to be tested? Single methods, groups of methods, groups of classes? For example, setter methods can not sensibly be tested without other methods. Or, to test any method, first a constructed object will have to exist, and thus some call to the constructor will be needed.
What are the component's boundaries? What are the depended-on-components? With which of the depended-on-components can you just live, which would need to be mocked? Many components can just be used as they are - you would not mock math functions like sin or cos, for example.
What are the boundaries between unit-tests, that is, at which points in the long-running tests would you consider a unit-test to start and end? Which part of the recording is considered setup, which execution, which verification?

All these difficulties explain to me, why some generic tooling may be hard to find and you will probably be left to specifically created scripts for test code generation.

score 0 · Answer 2 · answered May 28 '19 at 14:23

I've taken a very lazy and practical imperfect solution, and it took me about 40hrs: 20 of which was wrapping my head around the C part enough to write unit tests for it and fix it, which amounted to about 30 lines -- the other 20 was fixing mostly trivial bytes/strings issues that futurize couldn't possbly handle and setting up CI.

Run futurize
Run the most desirable use case as an E2E test and fix issues, complete with new critical unit tests
CI w/ tox on 2.7/3.x for these

End result is an unchanged 2.7 codebase and a minimally working beta 3.7 codebase, the long tail 3.7 support for secondary use cases to be solved over time, see Dirk's long term answer.

Automatic low level testing generation based on high level desired behavioral examples

Top Level Problem

Sub Problem

Attempted Solution

2 Answers2