2

When testing my Apache Spark application, I want to do some integration tests. For that reason I create a local spark appliciation (with hive support enabled), in which the tests are executed.

How can I achieve that after each test, the derby metastore is cleared, so that the next test has a clean environment again.

What I don't want to do is restarting the spark application after each test.

Are there any best practices to achieve what I want?

Joha
  • 935
  • 12
  • 32
  • Spark test jars can be used, details: https://spark-testing-java.readthedocs.io/en/release-1.0/Scala/context_creation/spark-test-jar.html – pasha701 Jul 22 '19 at 12:45

1 Answers1

0

I think that introduction of some application level logic for integration testing kind of breaks concept of integration testing.

From my point of view correct approach is to restart application for each test.

Anyway I believe another option is to start/stop SparkContext for each test. It should clean any relevant stuff.

UPDATE - answer to comments

  1. Maybe it's possible to do a cleanup by deleting tables/files?
  2. I would ask more general question - what do you want to test with your test? In a software development is defined unit testing and integration testing. And nothing in between. If you desire to do something that is not integration and not unit test - then you're doing something wrong. Specifically, with your test you try to test something that is already tested.

For the difference and general idea of unit and integration tests you can read here.

I suggest you to rethink your testing and depending on what you want to test do either integration or unit test. For example:

  1. To test application logic - unit test
  2. To test that your application works in environment - integration test. But here you shouldn't test WHAT is stored in Hive. Only that the fact of storage is happened, because WHAT is stored shall be tested by unit test.

So. The conclusion:

I believe you need integration tests to achieve your goals. And the best way to do it - restart your application for each integration test. Because:

  1. In real life your application will be started and stopped
  2. In addition to your Spark stuff - you need to make sure that all your objects in code are correctly deleted/reused. Singletones, Persistent objects, Configurations.. - it all may interfere with your tests
  3. Finally, the code that will perform integration tests - where is a guarantee, that it will not break production logic at some point?
Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
  • If you only restart the spark session, tables in the metastore are still there. They are stored in a local derby database called metastore_db. Also the files of the tables will still be there in ´${spark.sql.warehouse.dir}´ – Joha Jul 22 '19 at 11:27
  • If you say it breaks the logic of integration testing, what would be the name of a test that tests a function that uses both, hive and spark? – Joha Jul 22 '19 at 11:29