2

I often have the need to write a command line script that will read from a database, perform some analytics, and write the results back to the database. My effort to decouple and create a separate data layer generally is to write scripts load.py, write.py, and do_analytics.py where load and write do the database interaction, and the do_analytics.py file is something like this:

import load
import write

def batch_classify(model_filepath='my_model.pkl'):
    with open(model_filepath, 'rb') as infile:
        model = pickle.load(infile)

    data_loader = load.DataLoader()
    data_loader.load_data()
    data_loader.clean_data()
    data = data_loader.data
    # Maybe do some more manipulations here...
    output = model.transform(data)

    data_writer = write.DataWriter()
    data_writer.write_data(output)

if __name__ == "__main__":
    # maybe would have some command line options here to pass to batch_classify
    batch_classify()   

I would now like to test some fixed dataset and make sure the classification (output) results are what I would expect. I don't need to test the actual database connection right now, so based on some research I think I want to do mocking as in this post but I'm not sure what level of object should be mocked, how to refactor correctly to actually test once I have the mocked object, and if this is even the best approach to begin with. When this has come up previously I have hacked around to get solutions that work via a small fixed test table in the actual database, but it's never elegant or clean code.

elz
  • 5,338
  • 3
  • 28
  • 30

1 Answers1

1

In your case I would have 4 files.

database_provider.py

class DatabaseProvider(object):
    def get_data(self):
        return db.get() # Get your data

    def set_data(self, data):
        db.set(data) # update your data

analytic_manager.py

class AnalyticManager(object):
    def __init__(self):
        self.database_provider = DatabaseProvider()

    def process(self, arguments):
        # Get data from DB
        data = self.database_provider.get_data()

        # Do your logic here
        data = self.clean(data)
        data = self.transform(data)

        # Save in DB
        self.database_provider.set_data(data)

    def clean(self, data):
        # do cleaning
        return cleaned_data

    def transform(self, data):
        # do transform
        return transformed_data

main.py

if __name__ == "__main__":
    arguments = whatever
    manager = AnalyticManager(arguments)
    manager.process(arguments)

test_analytic_manager.py

import unittest
from mock import Mock, patch

class TestAnalyticManager(unittest.TestCase):


    @patch("database_provider.DatabaseProvider.get_data")
    @patch("database_provider.DatabaseProvider.set_data")
    def test_process_should_clean_and_transform_data(self, mock_set_data, mock_get_data):
        # Arranges
        arguments = whatever
        manager = AnalyticManager(arguments)

        mock_get_data.return_value = ["data from DB", "data2 from DB"]

        expected_data = ["cleaned and transformed data1", "cleaned and transformed data2"]

        # Acts
        manager.process(arguments)

        # Asserts
        mock_set_data.assert_called_once_with(expected_data)

You can now mock your provider if you want. The most important is to do all your logic inside the manager and not the provider. Your db_provider should only do interaction with your DB and mapping the received data to your python objects.

Manager and Provider layers are really important to be able to mock. Separate responsabilities will avoid to have a spaghetti code.

M07
  • 1,060
  • 1
  • 14
  • 23
  • Thanks, this looks nice. To clarify, would best practice be to have the option to pass the DatabaseProvider object - either real or mocked - to the AnalyticManager on instantiation? – elz Jun 01 '17 at 20:18
  • In Python you don't have to pass an option for mocking in your unit tests. See the test_analytic_manager.py. You can use this class to test your logic before implementing the DB provider, – M07 Jun 01 '17 at 20:53