I need an algorithm or method of computing a checksum of a sql column that is easy to replicate in python given a csv.
I want to verify that the csv column and the sql column match
I have a scheme summing the binary_checksums of each row in the column on the sql side and python side into two overall column sums but I'm worried about collisions and I want to know if there's a faster or better way.
I need a function where c is a full sql column such that
python_function(c) == pyquery("EXEC sql_function("some_table",c))
where python_function(c) and sql_column(c) return something like a hash or checksum
one function doesn't need to encompass all possible types of c either.
although it's a plus if it does. you could give a scheme specific to varchars or ints or bytes, etc.
The csv's will be large almost 50 million rows 66 columns (varchar, ints, bits,smallints, decimal, numeric).
The csv comes from an external source and I need to verify that it matches the data in my database.
doesn't need to be 100% accurate missing less than 100,000 rows of difference is fine
As an example here's a high level implementation of my solution in python pseudocode.
def is_likely_equal(csv_filename, column_name):
column_data = get_column_data(csv_filename,column)
# I know it won't fit in memory this is an example
python_b_sum = get_b_sum(column_data)
sql_b_sum = some_db.execute("SELECT SUM(BINARY_CHECKSUM(column_name)) FROM table")
if python_b_sum == sql_b_sum:
return True
else:
return False
def get_b_sum(column_data):
b_sum = 0
for entry in column_data:
b_sum += b_checksum(entry)
return b_sum