I am working on writing unit tests for a project of mine that processes data. However, I have some scripts that take CSVs, concatenate them with Pandas, and then randomly sample them to make train/dev/test sets for machine learning tasks.
I am writing unit tests that generate some random data CSVs from which to test for. But how can I create reference data for what SHOULD be returned from the script I am trying to test?
# Example of my test setup:
@pytest.fixture
def create_reference_input_data():
# Create some random CSV strings and make some test input data CSVs
@pytest.fixture
def create_reference_output_data():
# create some fake output data from the data that was created in create_reference_input_data()
# this output data should be like what I am expecting from the script I am testing
# I will be using this data to assert to what is produced from the script I am testing.
return reference_train_df, reference_test_df, reference_dev_df
def test_collect_data(create_reference_output_data):
# Run the script that I am testing for. It generates randomly sampled data from concatenated CSV datas like what would be created in create_reference_input_data() fixture.
# CSV data to make train/test/dev splitted CSV data.
test_data = collect_data(input_path, output_path, test_split = .10, dev_split = .20)
for file1_row, file2_row in zip(reference_output_data, test_data):
assert file1_row == file2_row # assert lines of text are the same in reference and test
Hope this pseudocode makes some sense. I understand setting seeds and what not. But how can I manually create some test data for what my script SHOULD produce, and assert that it is what is actually produced when I call that script?