1

I am working on writing unit tests for a project of mine that processes data. However, I have some scripts that take CSVs, concatenate them with Pandas, and then randomly sample them to make train/dev/test sets for machine learning tasks.

I am writing unit tests that generate some random data CSVs from which to test for. But how can I create reference data for what SHOULD be returned from the script I am trying to test?

# Example of my test setup:

@pytest.fixture
def create_reference_input_data():
# Create some random CSV strings and make some test input data CSVs

@pytest.fixture
def create_reference_output_data():
# create some fake output data from the data that was created in create_reference_input_data()
# this output data should be like what I am expecting from the script I am testing
# I will be using this data to assert to what is produced from the script I am testing.
return reference_train_df, reference_test_df, reference_dev_df

def test_collect_data(create_reference_output_data):
# Run the script that I am testing for. It generates randomly sampled data from concatenated CSV datas like what would be created in create_reference_input_data() fixture.
# CSV data to make train/test/dev splitted CSV data.
test_data = collect_data(input_path, output_path, test_split = .10, dev_split = .20)

for file1_row, file2_row in zip(reference_output_data, test_data):
    assert file1_row == file2_row # assert lines of text are the same in reference and test

Hope this pseudocode makes some sense. I understand setting seeds and what not. But how can I manually create some test data for what my script SHOULD produce, and assert that it is what is actually produced when I call that script?

Coldchain9
  • 1,373
  • 11
  • 31
  • What you have is a multipart question. Have you had a look at creating the test data using `Faker` https://faker.readthedocs.io/en/master/pytest-fixtures.html? Have you checked how to writing to a tmpdir https://docs.pytest.org/en/stable/tmpdir.html? – thoroc Aug 27 '20 at 11:47

0 Answers0