13

Why it is not a good practice to compare two objects by serializing them and then compare the strings like in the following example?

public class Obj
{
    public int Prop1 { get; set; }
    public string Prop2 { get; set; }
}

public class Comparator<T> : IEqualityComparer<T>
{
    public bool Equals(T x, T y)
    {
        return JsonConvert.SerializeObject(x) == JsonConvert.SerializeObject(y);
    }

    public int GetHashCode(T obj)
    {
        return JsonConvert.SerializeObject(obj).GetHashCode();
    }
}

Obj o1 = new Obj { Prop1 = 1, Prop2 = "1" };
Obj o2 = new Obj { Prop1 = 1, Prop2 = "2" };

bool result = new Comparator<Obj>().Equals(o1, o2);

I have tested it and it works, it is generic so it could stand for a great diversity of objects, but what I am asking is which are the downsides of this approach for comparing objects?

I have seen it has been suggested in this question and it received some upvotes but I can't figure it out why this is not considered the best way, if somebody wants to compare just the values of the properties of two objects?

EDIT : I am strictly talking about Json serialize, not XML.

I am asking this because I want to create a simple and generic Comparator for a Unit Test project, so the performance of comparison does not bother me so much, as I know this may be one of the biggest down-sides. Also the typeless problem can be handled using in case of Newtonsoft.Json the TypeNameHandling property set to All.

meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
  • 4
    Since you are already using Json.NET, it provides an API method [`JToken.DeepEquals()`](http://www.newtonsoft.com/json/help/html/DeepEquals.htm) for comparing two serialized objects. – dbc Jul 16 '16 at 13:29
  • 2
    you need to replace `return obj.GetHashCode()`, with a new hasher, the default hasher uses the same logic as Equals as you have changed the equals behaviour the hash is now incorrect, as many processes only call Equals after checking if Hashes are equal you may get very odd results as it stands – MikeT Jul 18 '16 at 15:13

10 Answers10

16

The primary problem is that it is inefficient

As an example imagine this Equals function

public bool Equals(T x, T y)
{
    return x.Prop1 == y.Prop1
        && x.Prop2 == y.Prop2
        && x.Prop3 == y.Prop3
        && x.Prop4 == y.Prop4
        && x.Prop5 == y.Prop5
        && x.Prop6 == y.Prop6;
}

if prop1 are not the same then the other 5 compares never need to be checked, if you did this with JSON you would have to convert the entire object into a JSON string then compare the string every time, this is on top of serialization being an expensive task all on its own.

Then the next problem is serialization is designed for communication e.g. from memory to a file, across a network, etc. If you have leveraged serialization for comparison you can degrade your ability to use it for it normal use, i.e. you can't ignore fields not required for transmission because ignoring them might break your comparer.

Next JSON in specific is Type-less which means than values than are not in anyway shape or form equal may be mistaken for being equal, and in the flipside values that are equal may not compare as equal due to formatting if they serialize to the same value, this is again unsafe and unstable

The only upside to this technique is that is requires little effort for the programmer to implement

MikeT
  • 5,398
  • 3
  • 27
  • 43
  • I know this only for Newtonsoft, but there surely must be for other serialize libraries: [TypeNameHandling](http://www.newtonsoft.com/json/help/html/serializetypenamehandling.htm) erases the typeless problem, so it will work fine for classes in same project. – meJustAndrew Jul 18 '16 at 15:46
  • 1
    @meJustAndrew If you don't use JSON and say go with XML then the type problem goes away but the string size goes way up not to mention that XML have several ways to define the same object said attributes in a different order that would cause other issue, if you don't care about using serialisation for transmission and efficiency isn't important then there really isn't any reason not to use a simple but sub optimal approach – MikeT Jul 18 '16 at 15:52
  • 3
    ps Writing in English opposed to American doesn't count as a Typo ;) – MikeT Jul 18 '16 at 15:55
  • Okay, but consider using Newtonsoft like this, the only downside is performance? – meJustAndrew Jul 18 '16 at 17:13
  • 1
    You can drive your car 60Mph in second gear because its easier than using the gear shift, as long as you don't care about efficiency and performance, that is your choice. but you wont get anyone to recommend it – MikeT Jul 20 '16 at 14:32
  • that is relatively correct, but if you would drive your car at 60Mph in second gear, it will finally break because the engine is overused, and you will run out of gas faster. These are the kind of arguments I am looking for, and if the performance is the only concern, then it's really good to compare objects like this, but I doubt so. – meJustAndrew Jul 20 '16 at 14:45
9

You probably going to keep adding a bounty to the question until somebody tells you that it is just fine to do this. So you got it, don't hesitate to take advantage of the NewtonSoft.Json library to keep the code simple. You just need some good arguments to defend your decision if your code is ever reviewed or if somebody else takes over the maintenance of the code.

Some of the objections they may raise, and their counter-arguments:

This is very inefficient code!

It certainly is, particularly GetHashCode() can make your code brutally slow if you ever use the object in a Dictionary or HashSet.

Best counter-argument is to note that efficiency is of little concern in a unit test. The most typical unit test takes longer to get started than to actually execute and whether it takes 1 millisecond or 1 second is not relevant. And a problem you are likely to discover very early.

You are unit-testing a library you did not write!

That is certainly a valid concern, you are in effect testing NewtonSoft.Json's ability to generate a consistent string representation of an object. There is cause to be alarmed about this, in particular floating point values (float and double) are never not a problem. There is also some evidence that the library author is unsure how to do it correctly.

Best counter-argument is that the library is widely used and well maintained, the author has released many updates over the years. Floating point consistency concerns can be reasoned away when you make sure that the exact same program with the exact same runtime environment generates both strings (i.e. don't store it) and you make sure the unit-test is built with optimization disabled.

You are not unit-testing the code that needs to be tested!

Yes, you would only write this code if the class itself provides no way to compare objects. In other words, does not itself override Equals/GetHashCode and does not expose a comparator. So testing for equality in your unit test exercise a feature that the to-be-tested code does not actually support. Something that a unit test should never do, you can't write a bug report when the test fails.

Counter argument is to reason that you need to test for equality to test another feature of the class, like the constructor or property setters. A simple comment in the code is enough to document this.

Community
  • 1
  • 1
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Reading this answer three years later and I feel the need to upvote it again. In the meantime, libraries that support this approach grew like [deep equal](https://www.nuget.org/packages/DeepEqual/), and I still believe that this is the right way to do it in many cases. – meJustAndrew Sep 30 '20 at 08:02
5

By serializing your objects to JSON, you are basically changing all of your objects to another data type and so everything that applies to your JSON library will have an impact on your results.

So if there is a tag like [ScriptIgnore] in one of the objects, your code will simply ignore it since it has been omitted from your data.

Also, the string results can be the same for objects that are not the same. like this example.

static void Main(string[] args)
{
    Xb x1 = new X1()
    {
        y1 = 1,
        y2 = 2
    };
    Xb x2 = new X2()
    {
        y1 = 1,
        y2= 2
    };
   bool result = new Comparator<Xb>().Equals(x1, x2);
}
}

class Xb
{
    public int y1 { get; set; }
}

class X1 : Xb
{
    public short y2 { get; set; }
}
class X2 : Xb
{
    public long y2 { get; set; }
}

So as you see x1 has a different type from x2 and even the data type of the y2 is different for those two, but the json results will be the same.

Other than that, since both x1 and x2 are from type Xb, I could call your comparator without any problems.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
Ashkan S
  • 10,464
  • 6
  • 51
  • 80
  • as in the comment of the other answer : *I know this only for Newtonsoft, but there surely must be for other serialize libraries: [TypeNameHandling](http://www.newtonsoft.com/json/help/html/serializetypenamehandling.htm) erases the typeless problem, so it will work fine for classes in same project.* – meJustAndrew Jul 18 '16 at 15:47
4

I would like to correct the GetHashCode at the beginning.

public class Comparator<T> : IEqualityComparer<T>
{
    public bool Equals(T x, T y)
    {
        return JsonConvert.SerializeObject(x) == JsonConvert.SerializeObject(y);
    }
    public int GetHashCode(T obj)
    {
        return JsonConvert.SerializeObject(obj).GetHashCode();
    }
}

Okay, next, we discuss the problem of this method.


First, it just won't work for types with looped linkage.

If you have a property linkage as simple as A -> B -> A, it fails.

Unfortunately, this is very common in lists or map that interlink together.

Worst, there is hardly an efficient generic loop detection mechanism.


Second, comparison with serialization is just inefficient.

JSON needs reflection and lots of type judging before successfully compile its result.

Therefore, your comparer will become a serious bottleneck in any algorithm.

Usually, even if in thousands of records cases, JSON is considered slow enough.


Third, JSON has to go over every property.

It will become a disaster if your object links to any big object.

What if your object links to a big file?


As a result, C# simply leaves the implementation to user.

One has to know his class thoroughly before creating a comparator.

Comparison requires good loop detection, early termination and efficiency consideration.

A generic solution simply does not exist.

meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
Tommy
  • 3,044
  • 1
  • 14
  • 13
  • 1
    You have got your points, as for looped linkage, there exists *ReferenceLoopHandling* so it can be used with option *Ignore* just not to serialize an object with points to itself again, so in this case it will work, and what you secondly mention is that the serialize is slow, which is true. Why I am asking this question is because I want to make a custom comparator in the Unit Testing project, so it doesn't bother me that it will run with a second delay or not. It also saves me of days of programming to make custom comparators. I should have added this to the question too, just to clarify. – meJustAndrew Jul 21 '16 at 09:31
1

First, I notice that you say "serialize them and then compare the strings." In general, ordinary string comparison will not work for comparing XML or JSON strings, you have to be a little more sophisticated than that. As a counterexample to string comparison, consider the following XML strings:

<abc></abc>
<abc/>

They are clearly not string equal but they definitely "mean" the same thing. While this example might seem contrived, it turns out that there are quite a few cases where string comparison doesn't work. For example, whitespace and indentation are significant in string comparison but may not be significant in XML.

The situation isn't all that much better for JSON. You can do similar counterexamples for that.

{ abc : "def" }
{
   abc : "def"
}

Again, clearly these mean the same thing, but they're not string-equal.

Essentially, if you're doing string comparison you're trusting the serializer to always serialize a particular object in exactly the same way (without any added whitespace, etc), which ends up being remarkably fragile, especially given that most libraries do not, to my knowledge, provide any such guarantee. This is particularly problematic if you update the serialization libraries at some point and there's a subtle difference in how they do the serialization; in this case, if you try to compare a saved object that was serialized with the previous version of the library with one that was serialized with the current version then it wouldn't work.

Also, just as a quick note on your code itself, the "==" operator is not the proper way to compare objects. In general, "==" tests for reference equality, not object equality.

One more quick digression on hash algorithms: how reliable they are as a means of equality testing depends on how collision resistant they are. In other words, given two different, non-equal objects, what's the probability that they'll hash to the same value? Conversely, if two objects hash to the same value, what are the odds that they're actually equal? A lot of people take it for granted that their hash algorithms are 100% collision resistant (i.e. two objects will hash to the same value if, and only if, they're equal) but this isn't necessarily true. (A particularly well-known example of this is the MD5 cryptographic hash function, whose relatively poor collision resistance has rendered it unsuitable for further use). For a properly-implemented hash function, in most cases the probability that two objects that hash to the same value are actually equal is sufficiently high to be suitable as a means of equality testing but it's not guaranteed.

  • Consider Newtonsoft.Json as library, so the XML problem do not apply in this case. Also, as I have written the comparator in my question, it will be used only a single library to compare two objects, so the case when a library is outdated is not a problem. In this case, all of your answer does not apply to the question. – meJustAndrew Jul 19 '16 at 09:19
  • I disagree. As I mentioned in my post the problem applies to *both* XML *and* JSON. You can't just do naive string equality and expect it to work as you're expecting. To correctly compare XML or JSON strings, you have to consider their *meaning*, **not** just string equality. The only way your algorithm could *possibly* work in the general case is if you can *guarantee* that the serializer will *always* be *exactly* consistent about formatting, and the documentation provides no such assurance. – EJoshuaS - Stand with Ukraine Jul 19 '16 at 15:04
1

These are some of the downsides:

a) Performance will be increasingly bad the deeper your object tree is.

b) new Obj { Prop1 = 1 } Equals new Obj { Prop1 = "1" } Equals new Obj { Prop1 = 1.0 }

c) new Obj { Prop1 = 1.0, Prop2 = 2.0 } Not Equals new Obj { Prop2 = 2.0, Prop1 = 1.0 }

  • as in the comment of the other answer : *I know this only for Newtonsoft, but there surely must be for other serialize libraries: [TypeNameHandling](http://www.newtonsoft.com/json/help/html/serializetypenamehandling.htm) erases the typeless problem, so it will work fine for classes in same project*. In this case, only the performance is valid, but thank you! – meJustAndrew Jul 19 '16 at 09:15
1

Object comparison using serialize and then comparing the strings representations in not effective in the following cases:

When a property of type DateTime exists in the types that need to be compared

public class Obj
{
    public DateTime Date { get; set; }
}

Obj o1 = new Obj { Date = DateTime.Now };
Obj o2 = new Obj { Date = DateTime.Now };

bool result = new Comparator<Obj>().Equals(o1, o2);

It will result false even for objects very close created in time, unless they don't share the exactly same property.


For objects that have double or decimal values which need to be compared with an Epsilon to verify if they are eventually very close to each other

public class Obj
{
    public double Double { get; set; }
}

Obj o1 = new Obj { Double = 22222222222222.22222222222 };
Obj o2 = new Obj { Double = 22222222222222.22222222221 };

bool result = new Comparator<Obj>().Equals(o1, o2);

This will also return false even the double values are really close to each other, and in the programs which involves calculation, it will become a real problem, because of the loss of precision after multiple divide and multiply operations, and the serialize does not offer the flexibility to handle these cases.


Also considering the above cases, if one wants not to compare a property, it will face the problem of introducing a serialize attribute to the actual class, even if it is not necessary and it will lead to code pollution or problems went it will have to actually use serialization for that type.

Note: These are some of the actual problems of this approach, but I am looking forward to find others.

meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
1

For unit tests you don`t need write own comparer. :)

Just use modern frameworks. For example try FluentAssertions library

o1.ShouldBeEquivalentTo(o2);
Dmitrii Zyrianov
  • 2,208
  • 1
  • 21
  • 27
1

Serialization was made for storing an object or sending it over a pipe (network) that is outside of the current execution context. Not for doing something inside the execution context.

Some serialized values might not be considered equal, which in fact they are : decimal "1.0" and integer "1" for instance.

For sure you can just like you can eat with a shovel but you don't because you might break your tooth!

meJustAndrew
  • 6,011
  • 8
  • 50
  • 76
remi bourgarel
  • 9,231
  • 4
  • 40
  • 73
0

You can use System.Reflections namespace to get all the properties of the instance like in this answer. With Reflection you can compare not only public properties, or fields (like using Json Serialization), but also some private, protected, etc. to increase the speed of calculation. And of course, it's obvious that you don't have to compare all properties or fields of instance if two objects are different (excluding the example when only the last property or field of object differs).

Community
  • 1
  • 1
Taras Shevchuk
  • 326
  • 3
  • 14