17

I have 2 fixed width files like below (only change is Date value starting at position 14).

sample_hash1.txt

GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018

sample_hash2.txt

GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018

Using pandas read_fwf i am reading this file and creating a Dataframe (by excluding date value loading only first 13 characters). So my dataframe looks like this.

import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])

df1

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111

df2

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111

Now i am trying to genrate a hash value on each dataframe, but the hash is different. I was not sure what is wrong with this. Can someone through some light on this please? I have to identify if there is any change in data in file (excluding date column).

print(hash(df1.values.tostring()))
-3571422965125408226

print(hash(df2.values.tostring()))
5039867957859242153

I am loading these files(each file is around 2GB size) into table. Every time we are receiving full files from source, sometimes there is no change in data (except the last column date). So my idea is to reject such files. So if i can generate hash on the file and store somewhere(in a table) next time i can compare new file hash value with the stored hash. So i thought this is the right approach. But stuck with hash generation.

I checked this post Most efficient property to hash for numpy array but that is not what i am looking for

goks
  • 1,196
  • 3
  • 18
  • 37
  • 2
    The hash will be different for different object. Both dataframe are not the same. Try `df1.values.tostring() == df2.values.tostring()`, it should be false. If you want to have the same hash, you need to remove the data in the values before taking the hash. – TwistedSim Apr 17 '18 at 16:37
  • 1
    yes it is False. Is there any other way i can geneate a unique code based on the data in the file? (excluding some part of the data) – goks Apr 17 '18 at 16:41
  • 1
    you can try: `hash(df1[:-1].values.tostring())` to remove the last column. – TwistedSim Apr 17 '18 at 16:54
  • 2
    Possible duplicate of [Most efficient property to hash for numpy array](https://stackoverflow.com/questions/16589791/most-efficient-property-to-hash-for-numpy-array) – jdehesa Apr 17 '18 at 17:08
  • 1
    @TwistedSim last column is not in the dataframe anyway. i am loading first 13 characters only – goks Apr 17 '18 at 17:14
  • Oh, you're right, sorry. Can you show the result of `df1.values.tostring()` and `df2.values.tostring()` – TwistedSim Apr 17 '18 at 17:19
  • @jdehesa how come it is duplicate ? i am trying generate hash on a file excluding some of the data from file – goks Apr 17 '18 at 17:20
  • @TwistedSim >>> df1.values.tostring() b'\xb0=\xdb\xdbK\x7f\x00\x00\xf0=\xdb\xdbK\x7f\x00\x00' >>> df2.values.tostring() b'pf\xdb\xdbK\x7f\x00\x00\xb0f\xdb\xdbK\x7f\x00\x00' – goks Apr 17 '18 at 18:22
  • Try to use: `df1.values.astype(str).str`. you need to get a string representation of your data which is the same for both. `tostring()` output a "console friendly format" which might be different with both output. – TwistedSim Apr 17 '18 at 18:27

3 Answers3

25

You can now use pd.util.hash_pandas_object

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() 

For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

Roko Mijic
  • 6,655
  • 4
  • 29
  • 36
  • 1
    how would you return a single hash for the entire dataframe though? – Tom Nov 01 '22 at 22:30
  • 1
    This answer worked well for me. `pd.util.hash_pandas_object(df)` on it's own returned a series (I believe it is the length of the number of rows in the dataframe), but using the full answer provided (putting inside hashlib.sha1(...) resulted in a single hash for the dataframe for me (`hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() `) – Joseph Jan 19 '23 at 22:45
  • But this works only for the "content" of the dataframe and not its meta data like column and row names. – buhtz May 30 '23 at 12:44
6

Use string representation dataframe.

import hashlib

print(hashlib.sha256(df1.to_json().encode()).hexdigest())
print(hashlib.sha256(df2.to_json().encode()).hexdigest())

or

print(hashlib.sha256(df1.to_csv().encode()).hexdigest())
print(hashlib.sha256(df2.to_csv().encode()).hexdigest())
python-starter
  • 302
  • 3
  • 13
  • 5
    Awesome. This is working. But i think hash generation will be slow on big files ? – goks Apr 17 '18 at 21:47
  • Do you know why running the example in different run it give different hash ? Do you know how to get the same hash for the same dataframe during different run of the code ? – alessiosavi Jul 01 '20 at 08:13
1

The other answers here do forget the column names (column index) of a dataframe. The pd.util.hash_pandas_object() create a series of hash values for each row of a dataframe including it's index (the row name). But the name of columns doesn't matter as you can see here:

>>> from pandas import *
>>> from pandas import util
>>> util.hash_pandas_object(DataFrame({'A': [1,2,3], 'B': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64
>>> util.hash_pandas_object(DataFrame({'Foo': [1,2,3], 'Bar': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64

My solution

df = DataFrame({'A': [1,2,3], 'B': [4,5,6]})

hash = pandas.util.hash_pandas_object(df)

# append hash of the columns
hash = pandas.concat(
    [hash, pandas.util.hash_pandas_object(df.columns)]
)

# hash the series of hashes
hash = hashlib.sha1(hash.values).hexdigest()

print(hash)
buhtz
  • 10,774
  • 18
  • 76
  • 149