I have 2 fixed width files like below (only change is Date value starting at position 14).
sample_hash1.txt
GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018
sample_hash2.txt
GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018
Using pandas read_fwf i am reading this file and creating a Dataframe (by excluding date value loading only first 13 characters). So my dataframe looks like this.
import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])
df1
GOKULKRISHNA
0 ABCDEFGHIJKL
1 111111111111
df2
GOKULKRISHNA
0 ABCDEFGHIJKL
1 111111111111
Now i am trying to genrate a hash value on each dataframe, but the hash is different. I was not sure what is wrong with this. Can someone through some light on this please? I have to identify if there is any change in data in file (excluding date column).
print(hash(df1.values.tostring()))
-3571422965125408226
print(hash(df2.values.tostring()))
5039867957859242153
I am loading these files(each file is around 2GB size) into table. Every time we are receiving full files from source, sometimes there is no change in data (except the last column date). So my idea is to reject such files. So if i can generate hash on the file and store somewhere(in a table) next time i can compare new file hash value with the stored hash. So i thought this is the right approach. But stuck with hash generation.
I checked this post Most efficient property to hash for numpy array but that is not what i am looking for