56

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was that I create the function

def _get_array_hash(arr):
    arr_hashable = arr.values
    arr_hashable.flags.writeable = False
    hash_ = hash(arr_hashable.data)
    return hash_

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes())

See comments for the Most efficient property to hash for numpy array.

END OF INLINE UPD.

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})

In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165

In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165 

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'

In [16]: data_from_file = pd.read_csv(fpath)

In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085

In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730

Can somebody explain me, how's that possible?

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values, 
            columns=data_from_file.columns, 
            index=data_from_file.index)

and it works again

In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241

In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

JJJ
  • 1,009
  • 6
  • 19
  • 31
mkurnikov
  • 1,581
  • 2
  • 16
  • 19
  • 1
    This might help: https://github.com/TomAugspurger/engarde/issues/3 – Jan Katins Jul 27 '15 at 15:45
  • I tried approach with getting hash value of index and columns, and str(data_frame) value. It's slow, and suffers from the same issues. – mkurnikov Jul 28 '15 at 14:48
  • I'm interested in doing this as well - can I ask why you included " arr_hashable.flags.writeable = False"? Would you expect the hash() function to possibly modify the array otherwise? – Max Power Nov 08 '16 at 02:05
  • @MaxPower it was long time ago, so I don't remember exactly. But I think I was inspired by the http://stackoverflow.com/questions/16589791/most-efficient-property-to-hash-for-numpy-array/16592241#16592241. I worked by then. Now it doesn't work, but you can use `hash(a.data.tobytes())` instead, and you don't need `flags.writeable = False` anymore. See the comments to the referred answer. – mkurnikov Nov 08 '16 at 11:21
  • 2
    Actually, you don't even need `.data`, just use `hash(a.tobytes())` or `hash(df.values.tobytes())` if calling from DataFrame. I've updated the original question. – mkurnikov Nov 08 '16 at 11:32

4 Answers4

54

As of Pandas 0.20.1 (release notes), you can use pandas.util.hash_pandas_object (docs). It returns one hash value for reach row of the dataframe (and works on series etc. too).

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

If you want an overall hash, consider the following:

int(hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest(), 16)
Joshua Shew
  • 618
  • 2
  • 19
Jonathan Stray
  • 2,708
  • 2
  • 26
  • 32
  • 44
    Not 100%, but likely `hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()` will be less collisioney than `.sum()`. – mathtick Jun 24 '18 at 19:01
  • 10
    @mathtick indeed, otherwise reordering rows gives the same hash. – Sergey Orshanskiy May 14 '19 at 04:51
  • The problem with `hash_pandas_object` is that is not serializable, due to circular dependencies, see here: https://github.com/pandas-dev/pandas/issues/35097 and here: https://github.com/uqfoundation/dill/issues/374 – alessiosavi Jul 03 '20 at 09:49
  • 1
    If the column names are different will they return different values? – Grant Culp May 06 '21 at 15:11
  • 2
    @GrantCulp `hash_pandas_object` does not hash the columns names: the same data with different columns will result in the same hash. To avoid this you could hash `df.reset_index().T` instead of `df`, or add `df.columns.values.tobytes()` to the hash. – bckygldstn Apr 25 '22 at 17:13
18

Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).

import joblib
joblib.hash(df)
uut
  • 1,834
  • 14
  • 17
  • 2
    This does not work for me! (df1 == df2).all() is True, but hashes are different. – JulianWgs Feb 15 '20 at 12:50
  • 2
    @JulianWgs Do you have an example? For comparing two Series, if their name field is different, their hashes end up being different while their values are the same, but I couldn't replicate it for any DataFrame. – uut Feb 17 '20 at 22:21
13

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.

import pandas as pd
import hashlib
DATA_FILE = 'data.json'

data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)

assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
eMMe
  • 569
  • 5
  • 16
  • 1
    I found `.to_msgpack()` stable in Python 3.6 but not in 3.5 (not sure why, might have something to do with dictionaries being ordered in Python 3.6+). Just keep it simple and to `.to_csv().encode('utf-8')` instead. – ostrokach May 14 '17 at 03:37
  • Keeping ostrokach's (above) comment in mind: this solution has of the out-of-the-box advantage of dealing with unhashable dataframe-elements (in contrast with `pd.util.hash_pandas_object`). – keepAlive Feb 11 '19 at 08:07
  • 1
    As of I think pandas 1.0, `df.to_msgpack()` is deprecated. The recommended alternative to `.to_msgpack()` in the pandas documentation is pyarrow, but that opens up a whole new can of worms – Ben Lindsay Apr 15 '21 at 01:00
  • Also `data1.values.tobytes()` might return deterministic values for numeric dataframe contents, but if you have string values in your dataframe, you'll get a different bytestring for different python sessions. Might match within the same python session though – Ben Lindsay Apr 15 '21 at 21:07
  • Unfortunately `.values` fails to capture the uniqueness of a DataFrame, with `print(hashlib.md5(pd.DataFrame({'X': []}).values.tobytes()).hexdigest())` and `print(hashlib.md5(pd.DataFrame({'Y': []}).values.tobytes()).hexdigest())` producing the same hash. – Eytan Nov 22 '21 at 01:49
  • [`df.to_msgpack()` was deprecated in 2019](https://github.com/pandas-dev/pandas/issues/27722). – Janosh Oct 23 '22 at 16:35
2

This function seems to work fine:

from hashlib import sha256
def hash_df(df):
    s = str(df.columns) + str(df.index) + str(df.values)
    return sha256(s.encode()).hexdigest()
edmz
  • 189
  • 1
  • 8
  • This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/30291028) – Emi OB Nov 09 '21 at 16:08
  • Thank you for your advice I made a [separate question here](https://stackoverflow.com/questions/69890648/why-dont-we-get-the-same-hash-value-with-hashlib-sha256-when-passing-the-exact) – edmz Nov 10 '21 at 09:00