2

I want to take any object and get a guid that represents that object.

I know that entails a lot of things. I am looking for a good-enough solution for common applications.

My specific use case is for caching, I want to know that the object used to create the thing I am caching has already made one in the past. There would be 2 different types of objects. Each type contains only public properties, and may contain a list/ienumable.

Assuming the object could be serializable my first idea was to serialize it to json (via native jsonserlizer or newtonsoft) and then take the json string and convert that to a uuid version 5 as detailed in a gist here How can I generate a GUID for a string?

My second approach if it's not serializable ( for example contained a dictionary ) would be to use reflection on the public properties to generate a unique string of some sort and then convert that to uuid version 5.

Both approaches use uuid version 5 to take a string to guid. Is there a proven c# class that makes valid uuid 5 guids? The gist looks good but want to be sure.

I was thinking of making the c# namespace and type name be the namespace for the uuid 5. Is that a valid use of namespace ?

My first approach is good enough for my simple use case but I wanted to explore the second approach as it's more flexible.

If creating the guid couldn't guarantee reasonable uniqueness it should throw an error. Surely super complicated objects would fail. How might I know that is the case if using reflection?

I am looking for new approaches or concerns/implementations to the second approach.


Edit: The reason why I bounty/reopened this almost 3 years later is because I need this again (and for caching again); but also because of the introduction of the generic unmanaged constraint in c# 7.3. The blog post at http://devblogs.microsoft.com/premier-developer/dissecting-new-generics-constraints-in-c-7-3/ seems to suggest that if the object can obey the unmanaged spec you can find a suitable key for a key-value store. Am I misunderstanding something?

This is still limited because the object (generic) must obey the unmanaged type constraint which is very limiting (no strings, no arrays, etc), but its one step closer. I don't completely understand why the method of getting the memory stream and getting a sha1 hash cant be done on not unmanaged typed.

I understand that reference types are pointing to places in memory and its not as easy to get the memory that represents all whole object; but it feels doable. After all, objects eventually are made up a bunch of implementations of unmanaged types (string is an array chars, etc)

PS: The requirement of GUID is loose, any integer/string at or under 512 bits would suffice

ParoX
  • 5,685
  • 23
  • 81
  • 152
  • 1
    What you mean is called a *hash*. You can use MD5, for example. – Rob Sep 08 '16 at 06:24
  • Aren't you looking for `GetHashCode`? https://msdn.microsoft.com/en-us/library/system.object.gethashcode(v=vs.110).aspx – Tomas Smagurauskas Sep 08 '16 at 06:24
  • @tomas Hash codes are not guaranteed to be unique, just hopefully collision free. They also are specific to that object type – ParoX Sep 08 '16 at 06:29
  • @robert yes the uuid 5 uses sha1 to take a string to a guid. The issue is finding a string to represent that object – ParoX Sep 08 '16 at 06:31
  • 4
    @ParoX Well, talking about **any object** and getting a **unique** id for any object - any object is an **unlimited** bunch of objects that you are going to represent collision free with a guid which is limited to 16 bytes and therefore has **limited** unique values (by design). Do you see the core problem? – Sir Rufo Sep 08 '16 at 06:55
  • @sir rufo that is why I suggest a good enough for common application. The expectation that I won't hit a sha1 collision is well within reasonable for me – ParoX Sep 08 '16 at 06:58
  • What is the real use case for this generated id? Sync data? – Sir Rufo Sep 08 '16 at 07:05
  • As suggested by @TomasSmagurauskas, you can override `GetHashCode` method to generate unique GUID, considering the point suggested by @Sir Rufo – G J Aug 02 '19 at 03:50
  • @GauravSinghJantwal I am looking for ANY object to GUID translator. That means if I pass it `int[] = new []{1,2,3}` it should be able to get the same Id each time. With the default implementation of GetHashCode (Object.GetHashCode) it just gives the reference memory address so GetHashCode is not a good way to solve this problem (Microsoft even states GetHashCodes are not meant for identity but instead for hashtables) – ParoX Aug 02 '19 at 03:55
  • 3
    This sounds like an XY problem https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem How are you going to make the difference between `new object()` and `new object()` for example? How are you going to make the difference between two big XmlDocument beyond hashing the whole XML (performance anyone?)? There's no magic bullet that will work for *any* object. Note GetHashCode is not "for hashtables", it's for (in)equality comparison. If two objects GetHashCode() are different, then objects are different (reverse is not true). It's true that it doesn't solve your problem. – Simon Mourier Aug 02 '19 at 14:48
  • I suppose my question is, why does it have to be a GUID/UUID? Gathering the object's property values and feeding them into a MD5 or SHA hash should give you what you're looking for as @Robert suggested years ago. What part of your use case am I missing? – Tombatron Aug 02 '19 at 18:15
  • As denoted in my edited PS an explicit GUID is not needed, I probably meant "globally unique id" in the OP. Looping through the property values recursively and just keep on hashing them probably would fit my need at face value. If that is what @Robert suggest it was lost on me, because as far as I know there is no just "generate hash for object" function -- also what is the hash of a null value? If you have two objects `int[]{1,2,3}` and `long[]{1,2,3}` they would hash to the same thing because value wise they are the same, you have to somehow incorporate type signatures as well – ParoX Aug 02 '19 at 23:03
  • I'm not sure what the question is - the first example in your link basically tells you how to do it - the only missing step is maybe to convert the byte[] into a string (or write your own IEqualityComparer) for use in a Dictionary. If you don't want the unmanaged constraint, you must write your own serializer to convert the object to byte[] yourself. – MineR Aug 05 '19 at 08:19

3 Answers3

0

As others have said in comments, it sounds like GetHashCode might do the trick for you if you're willing to settle for int as your key. If not, there is a Guid constructor that takes byte[] of length 16. You could try something like the following

using System.Linq;
class Foo
{
    public int A { get; set; }
    public char B { get; set; }
    public string C { get; set; }
    public Guid GetGuid()
    {
        byte[] aBytes = BitConverter.GetBytes(A);
        byte[] bBytes = BitConverter.GetBytes(B);
        byte[] cBytes = BitConverter.GetBytes(C);
        byte[] padding = new byte[16];
        byte[] allBytes =
            aBytes
                .Concat(bBytes)
                .Concat(cBytes)
                .Concat(padding)
                .Take(16)
                .ToArray();
        return new Guid(allBytes);
    }
}
Curtis Lusmore
  • 1,822
  • 15
  • 16
  • This is specfic to this class. I suppose getting the bytes via reflection could be a thng. Also hashcode isn't guaranteed to be unique, and it is not global. Class A hashcode may be the same as Class B – ParoX Sep 08 '16 at 07:01
0

As said in the comments, there is no bullet entirely out of silver here, but a few that come quite close. Which of them to use depends on the types you want to use your class with and your context, e.g. when do you consider two objects to be equal. However, be aware that you will always face possible conflicts, a single GUID will not be sufficient to guarantee collision avoidance. All you can do is to decrease the probability of a collision.

In your case,

already made one in the past

sounds like you don't want to refer to reference equality but want to use a notion of value equality. The simplest way to do so is to trust that the classes implement equality using value equality because in that case, you would already be done using GetHashCode but that has a higher probability of collisions because it is only 32bit. Further, you would assume that whoever wrote the class did a good job, which is not always a good assumption to be made, particularly since people tend to blame you rather then themselves.

Otherwise, your best chances are serialization combined with a hashing algorithm of your choice. I would recommend MD5 because it is the fastest and produces the 128bit you need for a GUID. If you say your types consist of public properties only, I would suggest to use an XmlSerializer like so:

    private MD5 _md5 = new MD5CryptoServiceProvider();
    private Dictionary<Type, XmlSerializer> _serializers = new Dictionary<Type, XmlSerializer>();
    public Guid CreateID(object obj)
    {
      if (obj == null) return Guid.Empty;
      var type = obj.GetType();
      if (!_serializers.TryGetValue(type, out var serializer))
      {
        serializer = new XmlSerializer(type);
        _serializers.Add(type, serializer);
      }
      using (var stream = new MemoryStream())
      {
         serializer.Serialize(stream, obj);
         stream.Position = 0;
         return new Guid(_md5.ComputeHash(stream));
      }
    }

Just about all serializers have their drawbacks. XmlSerializer is not capable of serializing cyclic object graphs, DataContractSerializer requires your types to have dedicated attributes and also the old serializers based on the SerializableAttribute require that attribute to be set. You somehow have to make assumptions.

Georg
  • 5,626
  • 1
  • 23
  • 44
0

The problem of equality is a difficult one.
Here some thoughts on how you could solve your problem.

Hashing a serialized object
One method would be to serialize an object and then hash the result as proposed by Georg.
Using the md5 checksum gives you a strong checksum with the right input.
But getting it right is the problem.

You might have trouble using a common serialization framework, because:

  • They don't care whether a float is 1.0 or 1.000000000000001.
  • They might have a different understanding about what is equal than you / your employer.
  • They bloat the serialized text with unneeded symbols. (performance)
  • Just a little deviation in the serialized text causes a large deviation in the hashed GUID/UUID.

That's why, you should carefully test any serialization you do.
Otherwise you might get false possitives/negatives for objects (mostly false negatives).

Some points to think about:

  • Floats & Doubles:
    Always write them the same way, preferably with the same number of digits to prevent something like 1.000000000000001 vs 1.0 from interfering.
  • DateTime, TimeStamp, etc.:
    Apply a fixed format that wont change and is unambiguous.
  • Unordered collections:
    Sort the data before serializing it. The order must be unambiguous
  • Strings:
    Is the equality case-sensitive? If not make all the strings lower or upper case.
    If necessary, make them culture invariant.
  • More:
    For every type, think carefully what is equal and what is not. Think especially about edge cases. (float.NaN, -0 vs 0, null, etc.)

It's up to you whether you use an existing serializer or do it yourself.
Doing it yourself is more work and error prone, but you have full control over all aspects of equality and serialization.
Using an existing serializer is also error prone, because you need to test or prove whether the results are always like you want.


Introducing an unambiguous order and use a tree
If you have control over the source code, you can introduce a custom order function.
The order must take all properties, sub objects, lists, etc. into account. Then you can create a binary tree, and use the order to insert and lookup objects.

The same problems as mentioned by the first approach still apply, you need to make sure that equal values are detected as such. The big O performance is also worse than using hashing. But in most real live examples, the actual performance should be comparable or at least fast enough.

The good thing is, you can stop comparing two objects, as soon as you found a property or value that is not equal. Thus no need to always look at the whole object. A binary tree needs O(log2(n)) comparisons for a lookup, thus that would be quite fast.

The bad thing is, you need access to all actual objects, thus keep them in memory. A hashtable needs only O(1) comparisons for a lookup, thus would even be faster (theoretically at least).


Put them in a database
If you store all your objects in a database, then the database can do the lookup for you.
Databases are quite good in comparing objects and they have built in mechanisms to handle the equality/near equality problem.

I'm not a database expert, so for this option, someone else might have more insight on how good this solution is.

Chillersanim
  • 455
  • 3
  • 13