Python checksum calculation ideas

Question

I have a table in an excel sheet containing strings, with let's say 3 columns and 2 rows. E.g.:

"one" , "two", "three"
"four", "five", "six"

I want to come up with an algorithm that can calculate a checksum for the whole table, which would be different if a new row is added (obvious) or even if one string switch places with another one from another row (less obvious).

So for the following table, we would have a different checksum than the previous one:

"one" , "five", "three"
"four", "two", "six"

I know there are a lot of possible ways to accomplish this, but is there a clean & robust way to do it in python? I had thought about using the position of the string in the table as a factor in the calculation, but I don't know whether it's a good idea or not as checksum calculation is one of my weaker skills.

I feel I should mention that in that same excel file, I have other data that can change without modifying the checksum though. The checksum should only verify the above-mentioned table's data, and must be a numeric value (because I need to send it over a CAN bus afterwards).

Does anyone have any suggestions as to how I could accomplish this? Please be precise and give examples.

EDIT: So I came up with an algorithm but yet completely unsure about its robustness. My idea was to first calculate a checksum for each cell of the first column and multiply it by the row index, then do the same for the seconds column and the third (if we use the same table as in the description). Then finally, sum the resulting checksums by multiplying each one of them by their column index. So something like:

total_checksum = (checksum('one') * 1 + checksum('four') * 2) * 1 + (checksum('two') * 1 + checksum('five') * 2) * 2 + (checksum('three') * 1 + checksum('six') * 2) * 3

I willingly posted this calculation like this without any fancy code just to explain my point. Now for the code (not pretty at the moment sorry for this), we can assume that the table contents have been parsed into a list of dictionary with key=column and item=value. For the table from the example we have:

tab = [{"A": "one", "B": "two", "C": "three"},
       {"A": "four", "B": "five", "C": "six"}]

print str(calculate_checksum(tab))


def calculate_checksum(table):
    """Calculate checksum of a dictionary (represents the excel table)"""

    def checksum(st):
        """Convert a string into tables of ascii characters and reduce it"""
        return reduce(lambda x, y: x + y, map(ord, st))

    total_cks = column_count = 0

    # list the data of each column:
    for column in ['A', 'B', 'C']:
        data_list = [data[column] for data in tables]
        chk = 0
        for row_index in range(len(data_list)):
            chk += checksum(data_list[row_index]) * (row_index + 1) # multiply each cell's checksum by its row index

        column_count += 1
        total_cks += chk * column_count # multiply each column checksum by its column index

    return total_cks

With this algorithm, the original table's checksum is 7254.When I switch "two" and "five", I get 7094 so it works. When I switch "one" and "two" I get 7230 that's covered too.

what do you mean by *table*? Is it a csv file? pandas dataframe? Excel sheet? Just load it in and hash it. — Ma0, Jan 25 '18 at 09:55
@Ev.Kounis Sorry I myself haven't been precise. It's an excel sheet. I'll update the description thanks. What do you mean by "hash" it how do you do that? — Soyding Mete, Jan 25 '18 at 09:56
I am trying to understand . Do you want to check if a new row that is added is one among the list of rows that is already present ? — , Jan 25 '18 at 10:01
Can't you simply check and compare the hash of the excel file? [See example](https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file). — Prahlad Yeri, Jan 25 '18 at 10:03
@KrishnaSangeethKS No, what I want to do is to be able to verify with a checksum that 2 tables are not similar even if one of the item just switched rows. — Soyding Mete, Jan 25 '18 at 10:04
@PrahladYeri No because there is other data in that same excel file that can be changed that shouldn't impact the checksum of the mentioned table... — Soyding Mete, Jan 25 '18 at 10:05
What have you tried before? Do you have any research you can show us? — Horkrine, Jan 25 '18 at 11:09
Yes @Horkrine. I'll post an edit to explain the détails and looking forward to hear your remarks. — Soyding Mete, Jan 25 '18 at 15:03

score 0 · Answer 1 · answered Jan 25 '18 at 10:17

0

You wouln't need a checksum for this.

import pandas as pd
df1 = pd.read_excel("...")
df2 = pd.read_excel("...")
exact_match = df1.equals(df2)

exact_match would return True if values are matching exactly and False otherwise.

answered Jan 25 '18 at 10:17

That would work if the excel file only contained the table I'm interested in. But my excel file contains other sheets that should be able to be edited without having the checksum of the mentioned table modified. Another thing is that I actually need a checksum value, because I want to transmit this value on the CAN to an ECU which will compare it to its own checksum... I spared you guys these details to simplify the problematic. – Soyding Mete Jan 25 '18 at 10:31

Prahlad Yeri · Answer 2 · 2018-01-26T10:11:48.187

0

You can use the openpyxl python library to achieve this. All you have to do is loop through the sheet and store the cumulative data in a string (or any format you prefer) for a later comparison:

from openpyxl import Workbook, load_workbook

wb = load_workbook(filename = 'latest.xlsx')
sheet = wb["some_sheet"]
sheet_data = ""
i = 1
while sheet['a'+str(i)].value != None:
    sheet_data = sheet['a'+str(i)].value + "::" + sheet['b'+str(i)].value + "::" + sheet['c'+str(i)].value +  "\n"
    i += 1

if sheet_data != old_sheet_data: #store the old_sheet_data somewhere
    pass #not matched

You can install openpyx library using the pip3 or pip package managers.

edited Jan 26 '18 at 10:11

answered Jan 25 '18 at 13:30

Prahlad Yeri

3,567
4
25
55

We're coming closer thank you @Prahlad. This can indeed be used but i still need a checksum out of that. It really needs to be a numeric value. – Soyding Mete Jan 25 '18 at 14:56
Sure, once you got the string variable (`sheet_data`), just use `hashlib` library to calculate a checksum of it like in this example: `hashlib.sha256("a".encode('utf-8')).hexdigest()` – Prahlad Yeri Jan 26 '18 at 00:18
Refer to this other answer for reference: https://stackoverflow.com/questions/27519306/hashlib-md5-typeerror-unicode-objects-must-be-encoded-before-hashing – Prahlad Yeri Jan 26 '18 at 00:18
Your answer led me in the right direction, thank you @Prahlad. I actually need a 32 bits checksum to be able to send it over a CAN bus. By looking at the link you provided and extending the research, I found out that the zlib library contains what I need: adler32 algorithm, which generates a 32 bits checksum out of a string. So I've tried adding up all the strings I got from the table as you suggested, and using adler32 for the checksum: it works. My only doubt now is whether it's robust enough to detect even small changes all the time: it seems that its not the most reliable algorithm. – Soyding Mete Jan 26 '18 at 09:51
I should add that this method is okay when you're sure that adding up the strings won't be a problem. Meaning that with that method, checksum will not change if in the first row we have "on", "etwo", "three", instead of "one", "two", "three", because when you add them all up you get the same final string. – Soyding Mete Jan 26 '18 at 09:58
Can't you take a 256 bit hash, split it into chunks, and then send each individual chunk over the CAN bus? That way, you can use a more reliable hashing algorithm. – Prahlad Yeri Jan 26 '18 at 10:06
And yes, if you are worried about two cells changing in such a way that might not affect the checksum, please add a `concatenator` when building the `sheet_data` variable like this: `sheet_data = sheet['a'+str(i)].value + "::" + sheet['b'+str(i)].value + "::" + sheet['c'+str(i)].value + "\n"` – Prahlad Yeri Jan 26 '18 at 10:11
There would be too many chunks to send, it really needs to be limited to 32 bits. But a solution to the problem I mentioned above is to add a unique character between each string when adding them up. That way it works fine. EDIT: Sorry I didn't see your update, thank you. – Soyding Mete Jan 26 '18 at 10:11

score 0 · Answer 3 · answered Jan 26 '18 at 10:23

With the help of Prahlad Yeri, I've managed to achieve this. I've found out that what I needed was in the zlib library.

I add up all cell contents with ":" between each string (any other character considered safe can be used) and use the adler32 function of the zlib library to calculate a 32 bits checksum out of it.

import zlib

tab = [{"A": "one", "B": "two", "C": "three"},
   {"A": "four", "B": "five", "C": "six"}]

print str(calculate_checksum(tab))


def calculate_checksum(table):
    """Returns a 32bits checksum out of the concatenated cell contents, using Adler algorithm."""
    data = ""
    for row in table:
        for column in ["A", "B", "C"]:
            data += row[column] + ":" # Use ":" to separate each string

    return zlib.adler32(data)

Python checksum calculation ideas

3 Answers3