My data runs through a pipeline and I want to make sure that the input matches the output - making sure nothing in the pipeline is causing the data to change. To do this I am using Dask to compare the dataframes as the source contains over 2 million rows.
The code I am using to compare the dataframes is this:
import dask.dataframe as dd
input_df = dd.read_sql_table(input_table, con=input_engine.url, index_col=input_index_col_name, npartitions=1000)
output_df = dd.read_sql_table(output_table, con=output_engine.url, index_col=output_index_col_name, npartitions=1000)
for part_one, part_two in zip(input_df.partitions, output_df.partitions):
input_part_df = part_one.fillna("NULL")
output_part_df = part_two.fillna("NULL")
print(input_part_df.eq(output_part_df).all().compute())
I am wondering if there is a way to process the input_df.fillna and output_df.fillna functions in tandem so that it is faster? I have to fill the na's or they won't match as true if they are equal.
I looked into using delayed for dask but I don't think it applies when pulling in the data for the dataframes before doing the comparison. As you can imagine, the whole thing runs rather slow due to the amount of records it is pulling.