3

I have a script of about 50 lines that reads data from a database, loads it into a pandas dataframe and then perform numerous operations on the dataframe.

I was wondering how people generally test this type of code? I'm not talking about tools like assert_frame_equal, but rather principles people follow.

For instance, should I create 50 separate tests to basically test each operation performed or should I try to break up the script in smaller parts?

If anybody knows of quality open source projects that I can use as inspiration, please let me know.

Shihe Zhang
  • 2,641
  • 5
  • 36
  • 57
Kritz
  • 7,099
  • 12
  • 43
  • 73
  • Straight from the horse's mouth: https://github.com/pandas-dev/pandas/tree/master/pandas/tests – cs95 Nov 21 '17 at 06:48

2 Answers2

3

If you want to start to write python unit test, this question is recommended.

Since the 50 lines are relevant, you probably want a functional test.
Read the difference between unit, functional, acceptance, and integration testing.
If you know SOLID principle of object-oriented-design, refactoring to the code is needed.

About how to design a good test, What are the properties of good unit tests

Specific to pandas, use fewer data to improve performance for testing.
Make a dummy copy for testing, rather than use the origin data.
And check mainly on the key feature, you want to check.

Shihe Zhang
  • 2,641
  • 5
  • 36
  • 57
2

I may suggest such approach:

  1. Split the script into data retrieving and data processing part. It's better to test your data access/query code and computations separately.
  2. Prepare fixed dataset you will use for tests. It may be part of your production data or special dataset which cover some boundary conditions (like NaNs, zeroes, negative values, etc).
  3. Write test cases, that check results of your computations. You may check values directly or do some aggregations (COUNT, SUM) and compare it with expected values.

The number of checks depends on data and computation you do. For some cases it might be enough to check only SUM() of all elements, for others - check every item.

I prefer to check only a few general conditions, which would fail if something went wrong than cover all possible cases.

  • Thanks. For the fixed dataset, should that be something like a CSV file that I load into the dataframe or what would you suggest? – Kritz Nov 21 '17 at 07:33
  • Sure, you can use anything. Csv is easy to use. And you can store this file near your test cases. – Lazarev Ivan Nov 21 '17 at 07:36