10

I would like to see if DLL's (possibly compiled accross different machines) are the same. To do that what I was doing was loading the DLL and computing the MD5, which failed for DLL's failed across different machines (but had same source). This seems to be due to other metadata which is added at compilation time (as someone mentioned here).

I thought of reverse engineering the whole DLL and see if the code matches, however, I have two problems with this:

  • I can only find tools which do this, I can't quite seem to find a C# library or something similar which does what I need.
  • I am not 100% sure if the decompiled source will be the same on source compiled across different machines.

Any hints, tips and pointers would be appreciated.

Community
  • 1
  • 1
npinti
  • 51,780
  • 5
  • 72
  • 96
  • 1
    This will be an extremely difficult task to get it working 100%. But good luck, looking forward to possible solutions ;p – leppie Aug 21 '12 at 07:20
  • 2
    Interesting problem. I don't _think_ the compiler is guaranteed to generate the same IL for the same code, even on the same machine, so maybe even Reflector won't give the same result. – Rawling Aug 21 '12 at 07:21
  • 4
    _"I would like to see if DLL's [...] are the same."_ - you could also just improve your release planning. Or, what problem are you trying to solve? – CodeCaster Aug 21 '12 at 07:22
  • 2
    The compiler is not guaranteed to produce the same result twice, even on the same machine from the same source. See http://blogs.msdn.com/b/ericlippert/archive/2012/05/31/past-performance-is-no-guarantee-of-future-results.aspx – Mike Zboray Aug 21 '12 at 07:25
  • @CodeCaster: I do agree with you, but the DLL's have been used for quite some time and I have been told that versioning is not something I can rely on. – npinti Aug 21 '12 at 07:31
  • 1
    @mikez: The most important part: "The C# compiler embeds a freshly generated GUID in every assembly, every time you run it, thereby ensuring that no two assemblies are ever bit-for-bit identical" – Daniel Hilgarth Aug 21 '12 at 07:32
  • @DanielHilgarth: Even ignoring that, the metadata will likely not match, more so in the presence of 'compiler generated code'. – leppie Aug 21 '12 at 07:34
  • @leppie: Yes, but you don't need to go any further. Because of this GUID comparing MD5 hashs of two assemblies compiled from the same source code will always fail. – Daniel Hilgarth Aug 21 '12 at 07:36
  • @npinti: Your best solution would probably be NDepend, not a DLL, but a tool that does what you need. http://stackoverflow.com/questions/1280252/net-assembly-diff-compare-tool-whats-available – leppie Aug 21 '12 at 07:37
  • @DanielHilgarth: You can always make the GUID the same via Mono.Cecil, and then start comparing, but that is just the first of many many steps... – leppie Aug 21 '12 at 07:39
  • @DanielHilgarth the OP seemed at least aware of that issue. The post points out that there are many other 'undocumented' behaviors that can affect the compiler's output. Therefore even if one is able to work around the MVID issue the solution may work only for a limited set of compiler versions. The OP has not given us much additional context so we don't know if that would be acceptable. – Mike Zboray Aug 21 '12 at 07:47

4 Answers4

5

You might be right - it could be metadata. I don't think that is necessarily the most likely possibility, though.

The other reason the DLLs are different is likely that they were compiled against different versions of .NET, or possibly MONO.

There is no guarantee that decompiling the DLLs will yield identical code even if they were compiled from the same source; indeed, given the optimizing nature of compilers, there is a tiny, theoretical (but extant) chance that slightly different sources can compile to the same executable - often, a loop will be unrolled - that is, turned into sequential, non-looping instructions - when this would cause a savings in memory usage or CPU time.

If the programmer unrolls the loop manually and recompiles, that's an optimization that the compiler had been doing anyway - presto, two different sources with an identical output.

A better question would be what you're hoping to learn by comparing the two DLLs. If it's strictly for the sake of learning, that's awesome and to be commended - however, the amount of knowledge you'll need to make a meaningful study of this is quite high. You are likely to find better results by learning general, more applicable C#/.net techniques.

Winfield Trail
  • 5,535
  • 2
  • 27
  • 43
  • This has nothing to do with optimization. It is how the C# compiler emits 'compiler generated code'. This is almost guaranteed to be unique, even if you compile the same code on the same PC a second apart. The problem is not with the code layout or even method bodies, it is what the metadata points to and the the order of the metadata can and will probably change between 2 compiles. – leppie Aug 21 '12 at 07:31
  • The problem we are facing is that we have different versions of the same DLL's (same as in same name, different functionalities, which changed as the DLL was improved) across different machines. We would like to create a list of DLL's and their versions, and that is why I was wondering if what I was after was possible. – npinti Aug 21 '12 at 07:35
  • @npinti Check my answer, I believe you should do that. Give strong names to them. – Matías Fidemraizer Aug 21 '12 at 07:43
1

Sign this assembly using an strong name and you'll be able to be absolutely sure that two or more assemblies are just the same one - or different too -, because these have the same assembly version, same public key token and so.

I doubt that two different developers would have a repeated private key if code source and Visual Studio project aren't the same one.

Matías Fidemraizer
  • 63,804
  • 18
  • 124
  • 206
  • Even if the assembly is strong named and from the same source, the assembly hash will be different. – leppie Aug 21 '12 at 07:35
  • @leppie Yeah, I know that. But why checking that using hashes if you can use strong nameS? – Matías Fidemraizer Aug 21 '12 at 07:40
  • Are you suggesting using a different private key for every version/compile/build? That would be horribly painful IMO. – leppie Aug 21 '12 at 08:15
  • @leppie No, I'm suggesting using the same private key, but change assembly version. Maybe I'm wrong, but it seems that could be the right way of solving this issue. – Matías Fidemraizer Aug 21 '12 at 08:33
  • And what happens if 2 distinct teams use the same version number and same common private key? – leppie Aug 21 '12 at 08:43
  • 2
    @leppie No computer tool will replace human organization and knowledge. You know that. Tell me if it's reasonable to have "same version number for the same assembly for different trunks or branches or whatever?". 2 distinc teams use same common private key, but "1.0" version of "SomeAssembly" COULDN'T BE THE SAME if these are developed having different code base! Then the problem is a wrong versioning schema. – Matías Fidemraizer Aug 21 '12 at 08:49
  • Did .NET 4.5 not just exactly do what you described? An in-place upgrade is not as far fetched as you think. Counts a lot for keeping API's binary compatible. – leppie Aug 21 '12 at 09:45
1
  • Are those libraries yours, I mean, is it your code?
  • Are they installed by an installer of yours, or independently, and you just inspect them?

If in any way you can supervise their initial installation on target machine, you can do some poor-man's watermarking with plain old DLL resources.

Attach a binary resource with your own contents to each version of the DLL installed and then inspect that laster. It is much a if you embeded a public static readonly class Something{ public static SomeData MyImportantInformation = ...; } in each code and read it in runtime, or as if you use [Attributes] with the data over some classes and read them through reflection - but using binary resources has 2 tiny advantages:

  • you can add/remove the resourecs from a DLL after it has been built (a bit like with ILMerge tool)
  • you can read the resources from native code just as easily as from managed, and to read them you can load the DLL in very limited and resource-saving way

Mind that I mean 'the low-level resources', such as Manifest which usually sits resource on slot #0, or .exe/.dll icons.

On binary resources:

http://www.codeproject.com/Articles/4221/Adding-and-extracting-binary-resources

And on managed embedded resources, which are easier to use:

http://keithelder.net/2007/12/14/how-to-load-an-embedded-resource-from-a-dll/ https://stackoverflow.com/a/7978410/717732

You can add adding/modifying the resources to your build scripts, to be sure that each version published has different/correct information added. Of course, if you control the build process, then you may just as well fireup the aforementioned ILmerge to put aything into any DLL.. While most of that would work, but in general, I think this is an overkill and if done improperly it would break any security signatures, if it modifies the DLL after it is signed. It has to be done before it..

If you control the build process, you can just put the necessary versioning information in the code as class-static data, or simply as attributes at assembly level, or (...)

Or why don't you just use version numbers to differentiate the versions? :) ie. semantic versioning?

On the other hand, if you are working with not-yours DLLs and if you have no control on their deployment, then you are on the tough grounds. As others said, the compilers may apply many different tricks during the compilation, but - please note - they have both some legal, and logic restrictions on what they can do to the compiled code.

Example of "logic" constraints:
- they may change the instructions, but may not change the overall meaning and (side)effects - they may change both the code and data layout/structure, but not in a way that would change the algorithms to handle them etc

Example of "legal" constraints:
- they are not allowed to remove any public symbol (public = visible by other code modules, that is, in .Net that covers: public and protected, and sometimes even internal and private) - they are not allowed to change the name of any public symbol - they are not allowed to change the signature of any public symbol etc

Now if you limit yourself only to such information, you may gather/calculate hashes/signatures of any code in a way that has a chance to be compiler- and platform-independent. You will not get a definitieve answer that thay are the same or not, but you will get a view on how probable is that they are.

For the most trivial example: load the DLL via reflection and scan all classes for their public and non-public member names. Then, either calculate a hash over that string set, or just use the whole stringset, I'd be probably counted in kbytes at most. If a large change is made to the code, it is almost sure that some fields/methods will be added or removed. For smaller changes, you may also scan signatures of the methods: add parameter lists and types of the parameters and return values to the pool. A bit more of work and more probability of detecting the change.

For a non-trivial change: you may try to scan ILCode of the methods and detect structures in it. Compilers may inline and sometimes remove methods/loops/etc, but the overall structure is preserved. Specific block of code are executed n-times here or there, branches are in their place but maybe with sides swapped, etc. However, detecting the control structures is not easy, and comparing the code is even harder. For some codes it may give you a definitive answer of "exact same" but many times you will get "not same" even if they are. Some keywords on the subject is ... duplicate or plagiarism detector. This is how the research on such things started:) see ie. https://stackoverflow.com/questions/546487/tools-to-identify-code-duplications though I do not know if the tools mentioned scan the code, or the "bytes"..

Community
  • 1
  • 1
quetzalcoatl
  • 32,194
  • 8
  • 68
  • 107
0

We did manage to find a way around this... what we did was that we added a pre-build event which goes through some relevant files (the ones we change, such as .CS files) and we compute the hash value of each. Each hash value eventually contributes to the global hash of the DLL. Since we only have a handful of files, the chance of collisions was quite small.

We then add the checksum in the description of the DLL. This allowed us to compile the DLL's on different machines, but since their source was the same, the same checksum was yielded.

Thanks for all the answers provided, they where quite helpful.

npinti
  • 51,780
  • 5
  • 72
  • 96