What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.
In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.
If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.
The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.
Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record
, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).
(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch
—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record
should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)
The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)