0

I'm trying to make a function that can take an argument and return a unique, short expression of that data.

A hash.

There's a whole hashlib package for doing this, but hashlib only takes strings. I want to easily hash anything: lists, functions, classes, anything.

How can I either convert anything into a unique string representation so I can hash it, or better yet, directly hash anything?

I thought you might be able to get the bytes() representation of an object but this needs special encodings for whatever it's given, and whatnot. so I'm not sure if there's a solution there.

hash_any(thing):
    # convert thing to a string of it's unique byte data
    # return hashlib.sha256(byte_data_str)

How would you go about doing this?

Edit: I've found the correct vernacular to find what I'm looking for. This is what I mean:

Alternative to python hash function for arbitrary objects

What is the quickest way to hash a large arbitrary object?

Create Hash for Arbitrary Objects?

I'm sure on of these contains a solution I seek.

MetaStack
  • 3,266
  • 4
  • 30
  • 67
  • 1
    `str(variable)` works on any data type in python – R10t-- Feb 06 '20 at 21:36
  • @R10t-- Yes it does, and I thought about that first of all. but, correct me if I'm wrong, doesn't `str()` often optimize for printing/viewing an object? For example say the `thing` is a thousand row dataframe. if you `print(thing)` it prints the headers of the dataframe, then like ten rows, then `...` then the last ten rows. I assumed, that this is because `__str__` has been overwritten in Pandas DataFrames to have this functionality. But I want to make sure I hash all the data, not just the viewable stuff. So this is why I thought to get the bytes of the whole thing. – MetaStack Feb 06 '20 at 21:40
  • If you want an exact string representation for *any* object, then for most custom objects you need to write your own. Even a simple item such as a float might need to be represented in hex. – Jongware Feb 06 '20 at 21:46
  • 1
    @Legit Stack Ahh you may be right. I'm not sure about the `str()` implementation in pandas. It is very possible that pandas overrides the `__str__` implementation to pretty print objects if they are too large. Although, all python primitives will print all of their data and not concatenate. You could enforce the parameter to be a primitive and ensure users send in something like `dataframe.data` instead of the dataframe – R10t-- Feb 06 '20 at 21:48
  • @usr2564301 what's weird to me is that the data representation is unique at some level of the hierarchy - why isn't that level of the hierarchy available to me? – MetaStack Feb 06 '20 at 21:48
  • 1
    _"the data representation is unique at some level of the hierarchy"_ But is that the sense of uniqueness you want? Specifically, should `my_hash(list(5))==my_hash(list(5))` be true or false? What about `my_hash(MyObject(5))==my_hash(MyObject(5))` – ShapeOfMatter Feb 06 '20 at 22:44
  • @ShapeOfMatter cool username. love it. So those should both be true. but if `str(MyObject('abc'))` returns `'a...c'` and `str(MyObject('axc'))` returns `'a...c'` well then I'm in trouble, I can't rely on `str`. At some point, those two objects are different because they have different data (a different pattern). That's the layer I'm looking to reference the literal pattern layer. I don't care what it looks like, I'm not going to rely on it for anything except its ability to differentiate between different patterns of data. I just want different data to produce different hashes, thats all. – MetaStack Feb 06 '20 at 23:05

1 Answers1

1

This is working for now, not optimal or efficient, but its fine for what I need.

def string_this(thing):
    '''
    https://stackoverflow.com/questions/60103855/how-to-convert-anything-to-a-
    string-bytes-object-so-it-can-be-hashed
    attempts to turn anything into a string that represents its underlying data
    most accurately such that it can be hashed.
    '''
    if isinstance(thing, str):
        return thing
    try:
        # objects like DataFrames
        return thing.to_json()
    except Exception:
        try:
            # other things without built in to_json serializable functions
            return json.dump(thing)
        except Exception:
            # hopefully its a python primative type
            return str(thing)
MetaStack
  • 3,266
  • 4
  • 30
  • 67