4

I would like to know whether there is a proper term to describe "diffing" of / obtaining the delta between multiple files or data structures, such that the resulting "diff" contains first a description of the parts common to all files/structures, then descriptions of how this "base" file/structure must be modified to obtain the individual ones, ideally in a hierarchical fashion if some files/structures are more similar to each other than others.

There are some questions and answers about how to do this with certain tools (e.g. DIFF utility works for 2 files. How to compare more than 2 files at a time?), but as I want to do this for a specific type of data structure (namely JSON), I'm at a loss as to what I should even search for.

This type of problem seems to me like it should be common enough to have a name such as "hierarchical diff" (which however seems to be reserved for 2-way diffs on hierarchical data structures), "commonality finding", or something like that.

I guess a related concept about hierarchical ordering of commonalities and differences is formal concept analysis, but this operates on sets of properties rather than hierarchical data structures and won't help me much.

smheidrich
  • 4,063
  • 1
  • 17
  • 30
  • What I found out so far: according to [this](http://eelco.lempsink.nl/thesis.pdf), the sub-problem at the center of this is called *Maximum Common Embedded Subtree* and defined generally for any number of trees, but most publications then limit the number of trees to 2. According to [this](https://search.ieice.org/bin/summary.php?id=e75-d_1_95), "the problem is NP-hard if the number of input trees is more than two", which perhaps explains the total lack of libraries I've found so far. [This](https://journals.openedition.org/msh/pdf/11751) describes an approximate polynomial-time algorithm. – smheidrich Aug 06 '19 at 20:49

1 Answers1

3

There are multiple valid denominations :

  • Data comparison (or Sequence comparison)
  • Delta encoding
  • Delta compression (or Differential compression)

Algorithms:

Good Wikipedia links

Some implementations

Kenny Meyer
  • 7,849
  • 6
  • 45
  • 66
Ben Souchet
  • 1,450
  • 6
  • 20
  • Thanks, but none of these work with more than two inputs *and* compare JSON instead of files. There are some that work with more than two inputs (e.g. Diff3, although as the name suggests, it's limited to 3 inputs at most) and some that compare JSON (e.g. jsondifflib and jsondiffpatch), but none do both. Longest common subsequence problem is, I think, a special case of the maximum common embedded subtree problem I already mentioned. I guess what I'm looking for just doesn't have a handy name and isn't as common a problem as I thought it would be. – smheidrich Aug 20 '19 at 06:56
  • These algorithms work for 2 files but also for N files, I just looked in the [Diffuse](http://diffuse.sourceforge.net/) source code and Derrick Moser (the programmer behind the tool) use a variant of the algorithm used by [difflib](https://docs.python.org/2/library/difflib.html) (longest common subsequence). And by the way Diffuse work with JSON files – Ben Souchet Aug 20 '19 at 07:51