How do binary files work on git

Question

There is this LaTeX project I'm managing with git, in which I have several branches and I use master as a branch where I get all the changes (at the end of the project it will be the final release). Sometimes when I compile my project under a branch, get the pdf and then when I merge that branch with master I get a merge conflict (between master's version of the pdf and branch's version of the pdf). Some other times, both versions merge seamlessly. What am I doing that causes one and another situation? How do I ensure that both versions merge without conflicts?

Binary files won't merge, and often (when it is the result of building) should not be stored in source control. — crashmstr, May 17 '17 at 16:51
But why are you versioning the PDFs? On a LaTeX project, I would guess only the source files (tex files plus any other files required like images and stuff) would be versioned. Then PDFs would be produced from the source files by processing them with latex, no need to version the PDF files. And just in case, I don't see how binary files would be merged correctly, unless they are not changed on one of the diverged branches. — eftshift0, May 17 '17 at 16:51
If you compile your latex project to produce the pdf you should not commit that pdf to the repository. Git does not make any attempt at merging a file it believes is binary. If it believes your pdf is text and "successfully" merges it my guess is that you will have a corrupt pdf afterwards. — Lasse V. Karlsen, May 17 '17 at 16:57
@crashmstr Edmundo I often have to show how my work has progressed and it seemed a convenient way of accessing to my compiled pdf without relaying on the other person having their latex configured (packages, etc) _or having latex at all_. — mrbolichi, May 17 '17 at 16:59
@LasseV.Karlsen nope. Not even once have I corrupted a pdf that way — mrbolichi, May 17 '17 at 17:00
Well, git does not in any way merge binary files. If you have a binary file that is *changed in both branches* you will get a merge conflict. However, if the file has changed in only one branch, that change is the winner of the merge for that file. Could that be it? You only change the file in one branch some of the time? — Lasse V. Karlsen, May 17 '17 at 17:01
@LasseV.Karlsen not sure. This is what I see in the merge fast-forward when it does not merge-conflicts: `foo/bar.pdf | Bin 928860 -> 929962 bytes` — mrbolichi, May 17 '17 at 17:09
I can understand the "convenience" (even if it's going against how git should be used) but for your flow to make sense, if you decide to add your PDFs to the revisions, you _should_ be generating them every time there's a new revision.... specially when merging (because PDFs will explode and only by regenerating them from the merged source files you will get the "correct" pdf content). — eftshift0, May 17 '17 at 17:11

Roland Smith · Answer 1 · 2017-05-17T17:36:34.583

5

It is generally considered good practice that anything that can be built from sources is not put under revision control. That is, it should be listed in a .gitigore file.

There are several reasons for this;

It generates a lot of extra data (that can easily be reproduced) to store in the repo.
You might get merge conflicts on binary files as you have discovered. Binaries usually cannot be merged in a meaningful way. You can, however choose one of them to replace the other. See the ours or theirs merge strategies.
If the sources are also merged, you'd have to create a new binary afterwards anyway. Otherwise the binary is inconsistent with the source.

For LaTeX repositories, my .gitignore contains at least:

*.aux
*.bbl
*.blg
*.fdb_latexmk
*.fls
*.idx
*.ilg
*.ind
*.lof
*.log
*.lot
*.out
*.toc

(I'm using latexmk for building LaTeX documents.)

edited May 17 '17 at 17:36

answered May 17 '17 at 17:11

Roland Smith

42,427
3
64
94

For comparison, one of mine has `.depend *.ans *.aux *.bbl *.blg *.dvi *.lof *.log *.lol *.lot *.out *.pdf *.toc` The `.depend` file is something I build, and the `.ans` file is for answers to selected exercises (from my own exercises macros). – torek May 17 '17 at 17:35
Thanks. I'm using a `.gitignore` myself to prevent those aux files to be sourced. In fact, I commented out `*.pdf` from it, due to the convenience of having the pdf available online – mrbolichi May 17 '17 at 17:41

score 5 · Accepted Answer · edited May 23 '17 at 12:10

As crashmstr says in a comment, binary files won't merge at all. However, there's something you should understand about git merge: it doesn't always merge files. In fact, it doesn't ever really merge files, except as a side effect. It sometimes (not always) merges commits. When it does that, some of those sometimes require it to merge files.

As everyone else has also said so far in comments, "compiled" files (outputs of programs that work on the files that you do want to manage with a version-control system—the modern term for these seems to to be build artifact, though artifact has a more general definition) generally shouldn't be committed in Git.

What `git merge branch` does

When you run git merge, you:

are sitting on some commit, usually the tip of a branch (via git checkout branch-name): this commit is the one named by HEAD (try git rev-parse HEAD to see the hash ID, and git symbolic-ref HEAD to see how Git finds your current branch name from HEAD);
supply the name of another branch, or any other identifier that resolves to another commit (try git rev-parse branch-name to see how this works).

The merge command then runs a merge strategy (-s recursive, by default). There are some special strategies that do different things, but the default one takes your two commit hashes and grubs through the commit graph, also called the DAG for Directed Acyclic Graph, to find the merge base. You can view this graph with git log --graph or git log --all --decorate --oneline --graph, for which "A DOG" is a useful mnemonic, to remember the All Decorate Oneline Graph options. The merge base is, roughly speaking, "where the two lines in the graph, starting from your HEAD and other commits, first come together again."

We can draw this graph ourselves in a way that looks better on StackOverflow (actually there are lots of ways to draw it):

       C--D--E   <-- branch1
      /
...--B
      \
       F--G--H   <-- branch2

where each uppercase letter represents a commit. Here, the two tips of the two branches are commits E and H, and their merge base is commit B.

To merge (as a verb) commits E and H, Git essentially runs git diff B E (to see what changed in branch1 since the base commit) and then a second git diff B H (to see what changed in branch2). If there are changes to different files in these two lines, the merge result is easy: we just take whichever files changed in both lines, and all the unchanged files from the base B, and pile them together.

If E and H both have changes to one particular file, though, then git merge must combine (merge) those changes to that file. If the file is binary, Git will—at least by default—immediately give up and declare a conflict. This would be the case for your PDF file: if it's different in both E and H, vs B, Git will declare a merge conflict and make you resolve the file.

In any case, once all conflicts are resolved, git merge normally makes a new merge commit. This is a merge: merge as a noun. A merge commit is a commit with two parents, which we can draw as:

       C--D--E
      /       \
...--B         I
      \       /
       F--G--H

Note that I have left off the branch names this time. The new commit I is the same (in terms of committed files), regardless of which branch name we move to point to it. The branch name that moves, though, is the one we were on when we ran git merge. Hence if we were on branch1, the result is:

       C--D--E
      /       \
...--B         I   <-- branch1
      \       /
       F--G--H   <-- branch2

but if we were on branch2, the result is:

       C--D--E   <-- branch1
      /       \
...--B         I   <-- branch2
      \       /
       F--G--H

In other words, the new commit gets made in the usual way: whatever branch we're on now, that branch name is changed so that it points to the new commit. The new commit itself—commit I, in our case—points back to the previous commit, and for a merge commit, also points back to the other commit as well.

As a subtle but important point, the first parent of the new commit is the one that was the HEAD commit at the time. So while the contents of merge I don't depend on which branch we were on, the first parent does. If we use git log --first-parent, later, we'll follow only the first parent when looking at the history of commits. Since that's the branch we were on, that means we'll go back to either E or to H as appropriate.

When `git merge` doesn't merge

The drawings above deliberately cover only one of four possible cases.

Suppose that instead of:

       C   <-- branch1
      /
...--B
      \
       D   <-- branch2

or the like, we have:

       C   <-- branch1 (HEAD)
      /
...--B    <-- branch2

Now the merge base commit B is the tip commit of branch2. We're on branch1—that's why it's marked (HEAD)—but there is nothing from branch2 to merge. In this case, git merge says "already up to date" and does nothing.

Or, suppose we have this instead:

       C   <-- branch2
      /
...--B    <-- branch1 (HEAD)

In this case, the merge base of branch1 and branch2 is commit B, again, but branch2 is ahead of branch 1. Git can, and by default will, skip the merge and do what it calls a fast-forward instead. It will change the name branch1 so that it points directly to commit C, and check out commit C, giving:

       C   <-- branch2, branch1 (HEAD)
      /
...--B

This "fast forward merge" (which is not a merge at all) happens very often when you are sharing an "upstream" repository (such as one on GitHub) with others who also work and push there. If one of you does some work and pushes, and the other has made no new commits and does a fetch-and-merge, Git sees that the new commits obtained from upstream are "fast-forward-able" and does this instead of doing a true merge.

You can defeat this with git merge --no-ff. Some workflows call for that.

There is one last possible case, but it's pretty rare: there may be no merge base at all. This happens if you combine two separate repositories, or use git checkout --orphan to start a new independent commit sub-graph. Here we might draw the entire graph as:

A--B--...--G--H   <-- branch1 (HEAD)

I--J--...--O--P   <-- branch2

If you ask Git to merge commits H and P, the result depends on your Git version. Older versions of Git try to merge these two graphs using Git's semi-secret empty tree as a base tree, which may or may not work depending on the contents of H and P. Since Git version 2.9.0, however, Git has started rejecting these by default, requiring --allow-unrelated-histories. (If you supply that flag, the merge goes ahead as before, using the empty tree as the base.)

Wow, this is a very comprehensive answer. So, if I understood you correctly, given that this is what I see in the merge regarding to the `pdf` file: `foo/bar.pdf | Bin 928860 -> 929962 bytes`, then what happened was simply that git replaced the "old" pdf with the "new" pdf, and not actually merged anything (a fast-forward)? — mrbolichi, May 17 '17 at 17:55
Yes: either that was a fast-forward merge (in which case all files just update to the new commit as if by `git checkout`), or it was a true merge but the base version (of `foo/bar.pdf`) matched one tip version, so Git extracted the other tip version to put in the new merge commit. — torek, May 17 '17 at 17:56

How do binary files work on git

2 Answers2

What git merge branch does

When git merge doesn't merge

What `git merge branch` does

When `git merge` doesn't merge