Checksum algorithm for column that can be verified outside of SQL using python and a csv

Question

I need an algorithm or method of computing a checksum of a sql column that is easy to replicate in python given a csv.

I want to verify that the csv column and the sql column match

I have a scheme summing the binary_checksums of each row in the column on the sql side and python side into two overall column sums but I'm worried about collisions and I want to know if there's a faster or better way.

I need a function where c is a full sql column such that

python_function(c) == pyquery("EXEC sql_function("some_table",c))

where python_function(c) and sql_column(c) return something like a hash or checksum

one function doesn't need to encompass all possible types of c either. although it's a plus if it does. you could give a scheme specific to varchars or ints or bytes, etc.

The csv's will be large almost 50 million rows 66 columns (varchar, ints, bits,smallints, decimal, numeric).

The csv comes from an external source and I need to verify that it matches the data in my database.

doesn't need to be 100% accurate missing less than 100,000 rows of difference is fine

As an example here's a high level implementation of my solution in python pseudocode.

def is_likely_equal(csv_filename, column_name):
    column_data = get_column_data(csv_filename,column)
    # I know it won't fit in memory this is an example

    python_b_sum = get_b_sum(column_data) 
    sql_b_sum = some_db.execute("SELECT SUM(BINARY_CHECKSUM(column_name)) FROM table")

    if python_b_sum == sql_b_sum:
         return True
    else:
         return False
   
def get_b_sum(column_data):
    b_sum = 0
    for entry in column_data:
        b_sum += b_checksum(entry)
    return b_sum

So I can compare the checksum of what I generate with python to the checksum of what I generate with SQL. And If they match I know both columns are the same. I'll update the question — DNS_Jeezus, Jun 30 '20 at 22:19
All checksums have the possibility of collision. Some more than others, but it's inescapable. — AlwaysLearning, Jun 30 '20 at 22:21
@AlwaysLearning I know. this is for data qualtiy purposes. reducing the chances of an undetected fault to the chances of a checksum collision is fine for me. — DNS_Jeezus, Jun 30 '20 at 22:24
@PM77-1 a checksum collision? because it'll increase the chances of a false negative. and sql_server's binary checksum is known to be farfarfar worse than something like md5 or sha-1. but I'm considering binary_checksum because it's fast and easy to work with. I'm asking this question for alternative ideas that might be better than what I'm considering. — DNS_Jeezus, Jun 30 '20 at 22:29
I think you are confusing [checksums and hashes](https://security.stackexchange.com/questions/194600/checksum-vs-hash-differences-and-similarities) ..... They serve different purposes. The function you are looking for in SQL Server is [`HASHBYTES`](https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15) — Alex, Jul 01 '20 at 04:25
As far as performance goes, it all depends on the size of data in said column. If your data is an integer (or short string), than computing and outputting hash values will be much more expensive than simply comparing integers (or short strings). In order to be more specific you need to provide more specifics yourself. Are you creating a CSV with SQL Server or importing it? Do you save to a table etc. , remember we know nothing about your intentions and current system set-up. How many columns, type and size of data etc. — Alex, Jul 01 '20 at 04:32
@Alex I know the difference. they are both different classes of mapping functions. they do the same thing they just guarantee different qualities about the output. — DNS_Jeezus, Jul 01 '20 at 15:33
It may or may not be that simple.... You may find the technical info in this question useful (make sure to read all comments as well): https://stackoverflow.com/questions/10559203/sha-256-hashes-different-between-c-sharp-and-javascript — Alex, Jul 01 '20 at 18:39

Checksum algorithm for column that can be verified outside of SQL using python and a csv

0 Answers0