How do diff/patch work and how safe are they?

Question

Regarding how they work, I was wondering low-level working stuff:

What will trigger a merge conflict?
Is the context also used by the tools in order to apply the patch?
How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

Regarding safety, truth be told, the huge Linux kernel repository is a testament for their safety. But I wondering about the following points:

Are there any caveats/limitations regarding the tools that the user should be aware of?
Have the algorithms been proven to not generate wrong results?
If not, are there implementations/papers proposing integration testing that at least prove them to be error-free empirically? Something like the content of these papers BrianKorver and JamesCoplien.
Again, the Linux repository should suffice regarding the previous point, but I was wondering about something more generic. Source code, even when changed, will not change much (specially because of the algorithm implemented and syntax restrictions), but can the safety be generalized to generic text files?

Edit

Ok people, I'm editing since the question is vague and answers are not addressing details.

Git/diff/patch details

The unified diff format, which Git seems to use by default, basically outputs three things: the change, the context surrounding the change, and line numbers pertinent to the context. Each one of these things may or may not have been changed concurrently, so Git basically has to deal with 8 possible cases.

For example, if lines have been added or removed before the context, line numbers will be different; but if the context and the changes are still the same, then diff could use the context itself to align the texts and apply the patch (I do not know if this indeed happens). Now, what would happen on the other cases? I would like to know details of how Git decides to apply changes automatically and when it decides to issue an error and let the user resolve the conflict.

Reliability

I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

Still kinda related to this subject, we know that Git/diff treat files as generic text files and work on lines. Furthermore, the LCS algorithm employed by diff will generate a patch trying to minimize the number of changes.

So here are some things I would like to know also:

Why is LCS used instead of other string metric algorithms?
If LCS is used, why not use modified versions of the metric that do take into account the grammatical aspects of the underlying language?
If such a metric that takes into account grammatical aspects are used, could they provide benefits? Benefits in this case could be anything, for example, a cleaner "blame log".

Again, these could be huge topics and academic articles are welcome.

Regarding point 3, Git is pure textual. It has absolutely no knowledge of code (other than for some display purposes). — MicroVirus, Nov 09 '15 at 15:47
@MicroVirus I believe you were referring to point 4 instead of 3. Anyway, yes, Git is pure textual, but source code follows lexical and syntax constructs. It has structure. Would diff still work on a randomly generated huge text file, for example? — cenouro, Nov 09 '15 at 16:26
No, I was referring to point 3 of your first enumeration. Regarding the question in your comment, yes, you can use Git to keep track of thesis writing, for instance. As long as the source is text-based, git works (reasonably) well. Git works best for code, though. — MicroVirus, Nov 09 '15 at 16:33
Your question as currently written isn't a good fit for stack overflow. — Andrew C, Nov 09 '15 at 16:47
I did not answer the last two questions of the third set of questions, because answers are clearly opinion based. My guess is that since the current algorithms work quite well and a language-aware variant would need *much* work (starting at "how do I even recognize a language" - have a look at [github's linguist issues](https://github.com/github/linguist) for an impression how complicated it is to even get this step right), noone ever felt it worth the effort. — Phillip, Nov 10 '15 at 17:31

Phillip · Accepted Answer · 2015-11-10T17:25:59.907

What will trigger a merge conflict?

Let's look at the simplest of git's merge strategies, recursive, first: When merging two branches, say a and b, that have a common ancestor c, git creates a patch to go from commit c to the commit ad the head of a and tries to apply that patch to the tree at the head of b. If the patch fails, that's a merge conflict.

git by default uses the recursive strategy, a 3-way merge. The general idea is the same: If the 3-way merge algorithm described in the link fails because two commits from different branches changed the same lines, that's a merge conflict.

Is the context also used by the tools in order to apply the patch?

Yes. If a patch does not apply at the exact line number stored in the diff file, patch tries to find the right line a couple of lines adjacent to the original one based on the context.

How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

patch is not intelligent, it can not differentiate between such changes. It regards a moved function as a couple of added and a couple of deleted lines. If a commit on one branch alters a function and a commit on another moves the unaltered, then an attempt to merge will always give you a merge conflict.

Are there any caveats/limitations regarding the tools that the user should be aware of?

As for patch and diff: No. Both use algorithms that have been around since the early 1970s and are quite robust. As long as they don't complain, you can be fairly certain that they did what you intended.

That being said: git merge tries to resolve merge conflicts on its own. In some rare cases, things can go wrong here - this page has an example close to its end.

Have the algorithms been proven to not generate wrong results? If not, are there implementations/papers proposing integration testing that at least prove them to be error-free empirically?

"wrong results" is a fairly unspecific term in this context; I'd claim it cannot be proven. What is empirically proven is that applying a patch generated by diff a b to file a will in any case produce file b.

Source code, even when changed, will not change much (specially because of the algorithm implemented and syntax restrictions), but can the safety be generalized to generic text files?

Again, diff/patch/git does not differentiate between source code and other text files. git works as well on generic text files as it does on source code.

I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

Commits in git are snapshots of the tree with meta data, not diffs to the adjacent versions. Patch and diff are not involved in revision traversal at all. (But one level below the surface, git then organizes blobs in pack files that do use a delta compression algorithm. Errors here would be easy to spot because git internally uses sha1 sums to identify files, and the sum would change if an error occurred.)

Why is LCS used instead of other string metric algorithms?

git uses Myers' algorithm by default. The original paper explains why it works the way it does. (It's not purely LCS.)

Your answer is vague. Please improve it if you can. I will also further edit my question to be more understandable. — cenouro, Nov 09 '15 at 18:23
Inded, as is your question ;-) But there indeed is much more to be said on each of your questions. I'll make this a community answer so others can improve it as well. — Phillip, Nov 09 '15 at 18:36

anatoly techtonik · Answer 2 · 2015-11-17T18:44:31.160

diff/patch formats are not safe. =) Because they know nothing about your sources at all.

Here is the description of the format that I drew in (OMG) 2008.

Merge conflicts are triggered when the lines in source chunk are different or modified in the actual source file. Source chunk is composed of yellow lines that don't start with '+'. Red color outlines line numbers where patch program expects to find this source chunk before patching. If those lines were already modified somewhere in the history, you get a merge conflict.
Yes, context lines are used to check if the patch can be applied correctly, and also to find correct place when line numbers information (in red) is no more correct due to inserted stuff somewhere before these lines.
patch utility doesn't know anything about your code behavior - it just inserts and removes lines, giving you warnings when expected lines are not found (also can fail or try to find offset) or when they have been modified already (merge conflict)

Hopefully this explanation works for your second block of questions.

As for what could be done, I once came up with the Extensible Changeset Format, so that diff format could carry additional data for more intelligent patch tools. I sent idea to subversion mailing list in 2011, but there was much enthusiasm there at that time.

I documented the idea on Google Code, but it was shut down, so now it is buried (without any history) on GitHub here: https://github.com/techtonik/rainforce/blob/wiki/ExtensibleChangesetFormat.md

It didn't get anywhere due to the lack of real projects that could benefit from that. Actually I created an extended version (or better say alternative) of patch format that knows about files and directories. It was used to build incremental updates for Wesnoth in 2008 http://forums.wesnoth.org/viewtopic.php?f=5&t=20132 but I was too greedy to release it to open source (or afraid that I won't be able to support the tool properly and it will be forked by some commercial company who will make a lot of money). Here is how the extended/alternative version of path format looks like:

[PatchPlan version 0.1]------------------------------------
* Description   : 
* Old Version   :
* New Version   :
* URL       :
* Patch made by : 
* Comments  :
* Used tools    :
-----------------------------------------------------------
[C ] ... "dir1/dir2/filename.ext" MD5
         N:"dir3/dir4/newfile.ext" MD5
[C ] ... "dir1/dir3/filename.ext" MD5
         P:"dir1/dir3/patchfile.ext.patch" TYPE MD5
[D ] ... "dir1/dir2/filename.ext" MD5
[A ] ... "dir1/dir2/filename.ext"
         N:"dir3/dir4/newfile.ext" MD5
[AD] ... "dir1/dir2/newdirname"
-----------------------------------------------------------

[C ] ... - Status field

         C  - Changed
         D  - Deleted
         A  - Added
         DE - Deleted empty directory
         DD - Deleted directory
         AD - Added directory
         ... - place reserved for flags like overwrite,
               force deletion etc. flags are separated by
               spaces from other fields




"dir1/dir2/filename.ext" - relative path of patched file


MD5    - 32 letters, i.e. cb5bc9f48388568178f24e6294ea782b


N:"dir3/dir4/newfile.ext" MD5
       - path for replacement file relative to directory
         with new files, i.e. temp directory where they
         were extracted, for example

P:"dir3/dir4/patchfile.ext.patch" TYPE MD5
       - path for patch file that should be applied
         TYPE is one of the supported 
         - VPATCH (DIFF, BSDIFF are planned)
       - .patch extensions is not a requirement, but
         merely a convention
       - MD5 in this case is MD5 of the new patched file
         and not of the patch itself


[D ] ... - MD5 of the file to be deleted

Given that, anybody could derive the tool on their own for comparing dirs and patching them, building binary patches, text patches etc. There is still no place for extended information, but those could be added as more stories appear. I would, of course, be interested to participate in full-time development of such tool (or rather open source my own).

Today I would add knowledge about repository, tests that should fail before patch is applied, additional info useful for reviewer (such as detecting qualifications and code level required for review) and many other ideas.. format of hashed to make a continuous block-chain from the patch series, multi-level tools to detect if patch is orthogonal to other changes in the whole source code tree. But this requires funding and more-than-one-man-army.

score 2 · Answer 3 · answered Nov 10 '15 at 19:25

What will trigger a merge conflict?

Find the original version, the one both branches started with. Run two diffs against the original, one to the left branch tip version, the other to the right. Anywhere the two show different changes in overlapping change hunks is a conflict git refuses to autoresolve. That's it.

Is the context also used by the tools in order to apply the patch?

Merge doesn't need it, it's got both diffs, showing where each original line ended up in each tip. It knows exactly where to get and put changed lines.

How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

They don't. Think about trying to teach git what semantics apply where. If you're not mentally screaming in terror, you're not doing it :-)

You can supply your own merge drivers. It's easy. If you've got some common special cases you want handled automatically, just do it. Start with a simple shell script that invokes a builtin driver and then seds or awks or whatever for conflicts you can handle correctly automatically.

I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

Git's internal structure is incredibly simple. I'm not kidding. You can check the model's reliability by inspection. Keep the dag-of-trees structure and how merge operates in mind, try find a concrete concern or question about its reliability, I think you'll settle any as fast as they try to form.

If you're instead asking about the reliability of its implemented operations, whether it does compression correctly or transmits the right objects to satisfy a push or fetch or whatnot, that's spelled "does git have bugs?".

How do diff/patch work and how safe are they?

Edit

Git/diff/patch details

Reliability

3 Answers3

Linked