Expensive maintenance with automated test data

Question

At my company, we have a growing set of integration test using JUnit in a Java Web application. Each test uses some specific external XMLs files to populate the database with the needed data to the test. The problem is:

When there is a change in the model, we take a long time to correct all XML files (we have hundreds of XML files, a lot of them with redundancy).
The complexity of creating an XML file manually discourages the programmer to explore different scenarios.
We don't have a link between the test data and the test (e.g. At the test I don't know the 'name' of the User inserted by the XML). We could hard code the information we need, but it would also increase the maintenance time to keep both XML and hard coded data synchronized.

Facing this problem, I started thinking in using the own system CRUD to generate test data for each test. At the beginning of each test I would run some methods to persist the desired data for the test. In my vision, it would resolve all the 3 problems since:

Changes to the model require changing the CRUD anyway, so it would take no longer to correct the test data.
It would be easier to build, test data because we would not worry about thing like matching id and foreign key of the entity manually.
I would have all the important data in the variables with the sync guaranteed by the IDE.

but, for me, it lacks experience and knowledge to start this approach. The question is: Is this solution effective? Does this approach cause other problems? Where I can find this approach in the literature? Is there a better solution to the listed problems?

I have an answer to point 3: create XML files which encapsulate both the test data and the expected outcome. I would also suggest taking a look at TestNG, which supports feeding multiple test cases into a single test method. — biziclop, Apr 20 '15 at 20:33
Store the test data in the database and generate the XML from that. It's easier to modify the data in SQL and then create the XML. — Mike, Apr 20 '15 at 20:37
why aren't you working with object/classes, that generate the xml files ? This way, any change in the model , while have to update the class - this will affect all tests with a single change. — Mzf, Apr 20 '15 at 20:37
This is almost a good question for [Programmers.SE](http://programmers.stackexchange.com/) but is too broad. With some editing to hone its focus it would be a good candidate for migration. — , Apr 20 '15 at 20:38
Another thing is that if you have fragile tests, you're probably testing too much in one go. By the way, [this](http://xunitpatterns.com/Fragile%20Test.html) is a good starting point for diagnosing your test problems. — biziclop, Apr 20 '15 at 20:39
@biziclop - That's less likely to be true for integration tests. — antiduh, Apr 20 '15 at 20:43
@antiduh Integration tests can be too eager too. But there could be plenty of other reasons of course. — biziclop, Apr 20 '15 at 20:51

Nathan Hughes · Answer 1 · 2015-04-22T13:19:14.967

It sounds like your existing system uses something like DBUnit, where the tests start with a clean database, the test includes a setup step that loads data from one or more XML files into the database, then the test executes against that data.

Here are some of the benefits for this kind of approach:

If you do have a problem with the crud layer then that won't impact the data setup. When something goes wrong you should get one test failure per error, not one error for every related setup that fails.
Each test can be very explicit about exactly what data is needed to run the test. With a domain model sometimes between things like optional associations and lazy loading, what objects get loaded may not be certain. (Here I'm especially thinking of Hibernate where many times the consequences of a mapping may be complicated.) By contrast, if the data is setup in a more declarative way, stating what rows go in what table, the starting state is explicit.

Keeping tests simple, explicit, and minimally coupled to other parts means there's less to figure out and less to go wrong. If your tests get so complicated that any problem is less likely to be with the code under test than with the test, people will get discouraged from running and updating tests.

With DBUnit you can write a script to automate creating your XML from the database contents, so you can recreate the state you need and save it as XML. There shouldn't be any need to generate your test data manually.

It is possible for test data to become fragmented and hard to update, especially if it has been created in an ad-hoc fashion with no thought for reuse. You might consider going back through the tests and breaking up test setup data into pieces that you can reuse.

The pain points you describe don't seem to me like they require extreme measures like redoing all your test setups. Even if you do, you'll still want to refactor your test data. Maybe use a smaller project as a proving ground for bigger changes, and make small incremental changes to most of the existing code.

I don't get your second bullet point: Why would associating data with a single test be harder in Java than in XML? — meriton, Apr 20 '15 at 21:19
@NathanHughes, thank you for answering The main issue that you cited, for me, was the impact of a problem in the crud affect many tests. To mitigate this point, i was thinking in making every CRUD used to build data as a test. So each test would run a bunch of other "crud tests" in order to build the test data. I think this would give me precision where the cause of broken tests and more coverage, although I know I'm breaking the rule of independency. What do you think about this? And redoing the whole tests setups is too extreme as you said, I was thinking in aply this to new tests — André Queiroz, Apr 22 '15 at 00:40

score 1 · Accepted Answer · answered Apr 20 '15 at 21:16

The key to improving maintainability is to keep DRY. Test data setup should not be redundant, and if you test technology offers no effective means of reuse, you are using the wrong technology.

Writing Java code for the test data setup gives you familiar and good tools to improve code reuse across tests. It also offers better refactoring support than XML, and makes the link between test data and test code explicit, because that's in the very same source file (or even the same method!). However, it does require tests to be written and maintained by programmers (not business analysts, managers, or testers that do not know Java).

Therefore, if test data is mostly authored and maintained by programmers, I'd do so in Java, through the CRUD layer (or even a full fledged domain layer) of the real application. If however most test data originates from some data export, or is authored by people that are not programmers, a purely data driven approach can be a better fit. It is also possible to combine these approaches (i.e. choose the most appropriate strategy for each entity).

Personal experience: Our team used to do integration tests with DBUnit, but have switched to setting up the test data as part of the test code by using our real data access layer. In so doing, our tests became more intention revealing and easier to maintain. Test effort was reduced, but test coverage was improved, and more tests got written with less prodding. This was possible because the tests were entirely written and maintained by developers.

Expensive maintenance with automated test data

2 Answers2