Automatic merging in git without conflicts (using word-by-word diff instead of line-by-line)

Question

I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).

git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:

$ git diff --word-diff-regex=. HEAD HEAD~

If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.

In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.

What approach would you choose?

score 4 · Accepted Answer · answered Nov 21 '18 at 23:40

What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.

In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.

If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.

The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.

Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).

(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)

The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)

Automatic merging in git without conflicts (using word-by-word diff instead of line-by-line)

1 Answers1