1

I have a git repository containing a bunch of PDF files. After running an OCR on some of them, I ran git add . to pick up the changes and stage them. After that, git status looks like this:

#       modified:   Ackerman, Daniel J., 1971 ST.pdf
#       modified:   Ackerman, Laura C., 2006 SD.pdf
#       modified:   Adolphson, Donald G., 1956 ST.pdf
#       renamed:    Baugh, Gerald R., 1956 ST.pdf -> Alkofer, Anton R., 1958 ST.pdf
#       modified:   Amundsen, Julie, 2012 ST.pdf
#       modified:   Babiracki, Dylan, 2015.pdf
#       renamed:    Wangerud, Kenneth W., 1973 ST.pdf -> Bailey, Palmer K., 1970 ST.pdf
#       modified:   Bakken, Wallace E., 1958 ST.pdf
#       modified:   Baugh, Gerald R., 1956 ST.pdf
#       modified:   Bednar, Jesse E., 1959 ST.pdf
#       modified:   Belanus, Luke 2016.pdf
#       modified:   Berg, Larry D., 1960 ST.pdf
#       modified:   Blanksma, Derrick J., 2011 SD.pdf
#       modified:   Blum, Raymond L., 1957 ST.pdf
#       renamed:    Overmoe, Terry H., 1956 ST.pdf -> Bonneville, John W., 1956 ST.pdf
#       modified:   Bonneville, John W., 1961.pdf
#       modified:   Brouillard, Lee A., 1977 ST.pdf
#       modified:   Brown, Ronald G., 1968 ST.pdf
#       modified:   Burrows, Robert A., 1995 ST.pdf
#       modified:   Bushaw, Dewey J., 1957 ST.pdf
#       modified:   Carns, Matthew, 2010 SD.pdf
#       modified:   Christensen, Robert E., 1958 ST.pdf
#       modified:   Christenson, Chase J., 2008.pdf
#       renamed:    Traynor, Terrance O. 1977 ST.pdf -> Clayton, Lee, 1960.pdf
#       modified:   Cook, Charles W., 1968 ST.pdf
#       modified:   Crowell, Anna M., 2011 ST.pdf
#       modified:   Davidson, Jerry, NA, ST.pdf
#       modified:   DeYaegher, Wilfred M., 1955 ST.pdf
#       modified:   Decker, Amy, 2005 SD.pdf
#       modified:   Degenstein, Joel A., 1975 ST.pdf
#       modified:   Dove. Andrea, 2014 ST.pdf
#       modified:   Elofson, Richard R., NA, ST.pdf
#       renamed:    Hoeft, Erin, 2014 ST.pdf -> Englerth, E. J., 1958 ST.pdf
#       modified:   Erickson, Kirth A., 1967 ST.pdf
#       modified:   Facca, Fosco V., 1970 ST.pdf
#       renamed:    Thomte, Dennis, NA, ST.pdf -> Flewitt, William E., 1957 ST.pdf
#       renamed:    Saunders, Gary, 1960 ST.pdf -> Forsgren, Frank M., 1980 ST.pdf
#       renamed:    Clayton, Lee, 1960.pdf -> Friestad, Harlan K., 1966 ST.pdf
#       modified:   Friestad, Mark B., 1970 ST.pdf
#       modified:   Friesz, Jacob; Bryantt, Tanner; Hanson, Luke; Delaney, Emily , 2014 SD.pdf
#       renamed:    Koons, Robert R., 1957.pdf -> Froelich, Larry L.,1964.pdf
#       renamed:    Halle, Richard, 1972 ST.pdf -> Galambos, William E., 1958 ST.pdf
#       renamed:    Huot, Ray E., NA ST.pdf -> Garske, Jay, 1957 ST.pdf
#       renamed:    Walsh, Michael W., 1956 ST.pdf -> Gillin, Donald S., 1958 ST.pdf
#       modified:   Gorecki, Charles 2007 SD.pdf
#       modified:   Gray, Lockhart R., 1958 ST.pdf
#       renamed:    Berg, Larry D., 1960 ST.pdf -> Groenewold, Joanne R., 1971 ST.pdf
#       modified:   Gunderson, Lori, 1998 SD.pdf
#       modified:   Halle, Richard, 1972 ST.pdf
#       modified:   Hannesson, James H., 1957 ST.pdf
#       modified:   Hartig, Caitlyn M., 2015 ST.pdf
#       modified:   Harvey, Erik W., 1991 ST.pdf
#       modified:   Hegle, Lloyd 2005.pdf
#       modified:   Hendrickson, Richard D., 1956 ST.pdf
#       modified:   Hesse, Damien; Krieger, Amanda; Padgett, Alex; Zander, Derek, 2012 SD.pdf
#       modified:   Hoeft, Erin, 2014 ST.pdf
#       modified:   Holweger, Todd L., 1995 ST.pdf
#       modified:   Hrabik, Jon, 2008 SD.pdf
#       modified:   Huot, Ray E., NA ST.pdf
#       modified:   Ignatius, Ashley, 2008 ST.pdf
#       modified:   Jahraus, Tim, NA, ST.pdf
#       modified:   Jeannotte, Tyson, 2015 ST.pdf
#       renamed:    Redmond, John C., 1955.pdf -> Jergens, Matthew, 2005 SD.pdf
#       modified:   Johnson, Corey 2009 SD.pdf
#       modified:   Johnson, Irwin S., 1957 ST.pdf
#       modified:   Jurgens, Matthew, 2005 SD.pdf
#       modified:   Klapperich, Ryan, 2004 ST.pdf
#       modified:   Klaudt, Elmer J.,1956 ST.pdf
#       modified:   Klosterman, Mary J., 1978.pdf
#       modified:   Knutson, Sean, 2007 SD.pdf
#       modified:   Koons, Robert R., 1957.pdf
#       modified:   Kringstad, Justin J., 2007 SD.pdf
#       modified:   Kume, Jack, 1958 ST.pdf
#       modified:   Lammers, Heather N., 2007 SD.pdf
#       renamed:    Ackerman, Daniel J., 1971 ST.pdf -> Lassila, Pentti, 1968 ST.pdf
#       modified:   Lindberg, Connor; Putkonen, Jaakko, 2015.pdf
#       renamed:    Brouillard, Lee A., 1977 ST.pdf -> Listoe, Bruce K., 1955 ST.pdf
#       renamed:    Blum, Raymond L., 1957 ST.pdf -> Lockrem, Timothy M., 1980 ST.pdf
#       renamed:    Cook, Charles W., 1968 ST.pdf -> Mathison, David J., 1964 ST.pdf
#       modified:   Meldahl, Charles, 1962.pdf
#       modified:   Mikkelson, D.H., 1956 ST.pdf
#       renamed:    Johnson, Irwin S., 1957 ST.pdf -> Moe, Richard B., 1958 ST.pdf
#       renamed:    Olien, Benjamin, 1957 ST.pdf -> Monsebroten, Dale R. 1966.pdf
#       modified:   Murphy, Edward C., 1979 ST.pdf
#       modified:   Myerchin, Paul H., 1994.pdf
#       modified:   Nelson, Kelly, NA, SD.pdf
#       modified:   Nestaval, Jerry E., 1958 ST.pdf
#       renamed:    Englerth, E. J., 1958 ST.pdf -> Norby, Rodney D., 1967 ST.pdf
#       modified:   Olien, Benjamin, 1957 ST.pdf
#       renamed:    Smith, Louis D., 1968.pdf -> Olson, Bruce A., 1974 ST.pdf
#       modified:   Opitz, Emil, 2007 ST.pdf
#       modified:   Overmoe, Terry H., 1956 ST.pdf
#       modified:   Peterson, Robert T., 1958 ST.pdf
#       renamed:    Solheim, Dale, 1957 ST.pdf -> Pilatzke, Richard H., 1976 ST.pdf
#       modified:   Quigley, Micheal L., 1958 ST.pdf
#       modified:   Ramsey, Bruce, 1972 ST.pdf
#       renamed:    DeYaegher, Wilfred M., 1955 ST.pdf -> Randich, Philip G., 1958 ST.pdf
#       renamed:    Lockrem, Timothy M., 1980 ST.pdf -> Rasanen, Ryan; Smrekar, Allison; Jahraus, Paul 2014 SD.pdf
#       modified:   Redmond, John C., 1955.pdf
#       modified:   Reishus, Mark, 1958 ST.pdf
#       modified:   Remple, Gary A., 1987 ST.pdf
#       modified:   Ries, Adam J., 2010 SD.pdf
#       modified:   Roehrich, Robert D., 1957.pdf
#       renamed:    Peterson, Robert T., 1958 ST.pdf -> Ross, James D., NA.pdf
#       modified:   Russell, Ashley, NA, ST.pdf
#       renamed:    Garske, Jay, 1957 ST.pdf -> Salomon, Nena 1974 ST.pdf
#       modified:   Samson, Sherry D., 1995.pdf
#       modified:   Sandven, John E., 2016 ST.pdf
#       modified:   Saunders, Gary, 1960 ST.pdf
#       modified:   Schmit, Craig R., 1970 ST.pdf
#       renamed:    Quigley, Micheal L., 1958 ST.pdf -> Schofeild, R.G., 1957.pdf
#       modified:   Smith, Daniel, 2009 SD.pdf
#       modified:   Smith, Louis D., 1968.pdf
#       modified:   Smith, Louis D., 1970 ST.pdf
#       modified:   Snyder, Jeffrey K., 1992 ST.pdf
#       renamed:    Davidson, Jerry, NA, ST.pdf -> Solheim, Dale, 1957 ST.pdf
#       modified:   Solie, Kevin L., 2008 SD.pdf
#       modified:   Stancel, Steve G., NA.pdf
#       modified:   Thompson, Gary G., 1962 ST.pdf
#       modified:   Thomte, Dennis, NA, ST.pdf
#       modified:   Traynor, Terrance O. 1977 ST.pdf
#       modified:   Trobec, Seth W., 2009 SD.pdf
#       modified:   Walker, Daniel M., 1979 ST.pdf
#       modified:   Walsh, Michael W., 1956 ST.pdf
#       modified:   Wangerud, Kenneth W., 1973 ST.pdf
#       renamed:    Degenstein, Joel A., 1975 ST.pdf -> Waxvik, John N., 1964 ST.pdf
#       modified:   Worden, Anna K., 2007 ST.pdf
#       modified:   Zejdlik, Roger C., 1956 ST.pdf

Why has it decided to "rename" some of these files to totally different file names? Both of the files in any given rename line exist -- for example, there is a file called Degenstein, Joel A., 1975 ST.pdf and another totally different file called Waxvik, John N., 1964 ST.pdf. But for some reason it's decided to rename one as the other.

It doesn't make a difference whether I add the files one at a time or do it all together. What's going on?

I have used git reset to unstage the changes at this point.

Will Martin
  • 4,142
  • 1
  • 27
  • 38

3 Answers3

4

Don't read too much into git saying a file was renamed. There is no such thing as a "rename" operation in git; it just tries to determine, after the fact, if the transition from the previously committed tree to the current indexed tree (in the case of git status) likely involved a file being moved/renamed; and if it thinks so, it says "renamed".

Keep in mind that the content as you see it when displaying a PDF is very different from the content as git sees it when processing a PDF. The data in a PDF is usually compressed, so it's not so obvious what text is there. A lot of the "content" from a binary perspective establishes the structure of the document, and that's probably the same for every one of those files.

So git's heuristics are confused. But here's the thing: It doesn't matter. If you look at the actual files, they should each have the correct data in them. Still, I can't blame you if you think the spurious output is distracting. For many commands you can assert some control over rename detection behavior; I don't know a "built in" way to suppress it for status unfortunately. But there is one idea...

For files of the type git expects to encounter, the rename detection works pretty well. When you store binary files (like PDF's) you defeat a number of git's features, because that's not what it's optimized for. You can make it behave much better by using git lfs.

https://git-lfs.github.com/

The main purpose of this is to limit the size of the core repo by moving large binaries (which git can't compress/diff well) into a separate "large file store", from which you only download a version when you need that specific version. (By contrast, a clone of a "regular" repo must copy every version of every file regardless of what you're checking out.)

But here's the cool thing: When you use lfs, core git "thinks" it's just storing these little "lfs pointers" - placeholders that lfs uses to find the real content when necessary. And in my tests, those pointers are always different enough that they are not detected as "renamed" unless a file is literally copied byte for byte.

Mark Adelsberger
  • 42,148
  • 4
  • 35
  • 52
1

Git is a content tracker, not a file tracker. Git did not rename the files, it's saying that you renamed the files because their content is so similar.

Git wasn't designed for tracking changes in binary data, like the data that pdf's are encoded in, so you can't really blame it. When tracking changes over binary data, all bets are off, because this is not what git was designed to do. It was designed to track (plaintext) source code files for version control purposes.

Matt Messersmith
  • 12,939
  • 6
  • 51
  • 52
  • But the content of the files in question is totally, 100% different. For example, the "Baugh, Gerald" file is listed as renamed to the "Alkofer, Anton" file, but both exist separately, have different file names, and their contents are completely different. Here are some md5 check sums. (4d64c0f9402e36fd88d5ada5106a201b Baugh, Gerald R., 1956 ST.pdf; de1884ef318688b916a57c0b1d758449 Alkofer, Anton R., 1958 ST.pdf). How is it detecting these two files as identical when they are not alike in any way? – Will Martin May 16 '17 at 16:34
  • They might not look alike when you view the files in something like adobe reader, but their *binary* representation may be similar. Try doing a `diff` (or `git diff`) on the files. The MD5 checksums mean almost nothing in this context. You could have 99% overlap between the files and their MD5 checksums would be completely different. Note that git may not be saying the files are *exactly* the same, but it thinks they are close enough that you have renamed them – Matt Messersmith May 16 '17 at 16:38
  • 1
    It's not 100% different - in fact, you can have git tell you how different it thinks they are (run `git diff`, it will show `similarity index`). This is likely due to aggressive similarity in the binary data in the PDF (embedded fonts, for example), not the content that you "see". – Edward Thomson May 16 '17 at 16:38
  • Hrm. When I run git diff, it does not show any kind of similarity index -- it just shows that all the files "differ". Do I need to be feeding it some kind of extra parameter? My googling has turned up many vague bits of documentation with no examples. Note: I'm on git 1.8.3.1 because that's what RHEL has. – Will Martin May 16 '17 at 17:01
  • It's probably detecting the files are binary (just like the linux `diff` tool does), so it doesn't want to show you exactly how they differ. The diff will look like nonsense and be non-interpretable to a human. Maybe use `cmp` or `vbindiff` if you really want to see the differences. If you're an `emacs` guy, you can use that as well. I'd suggest just adding the files, and forgetting about it. Git will not overwrite your files even if it thinks there is a rename, so you have nothing to worry about. If you're really really paranoid, copy your data somewhere and forget git all together. – Matt Messersmith May 16 '17 at 17:44
  • I would modify @mwm314's advice in two ways: (1) if you're really paranoid, it doesn't matter. The previous version is still there in the previous commit if you find that something broke. That's why you're using version control. (2) I would take this as a warning that core git is struggling to deal with the data you've given it, and consider a tool like `lfs` to mitigate the issue. – Mark Adelsberger May 16 '17 at 17:47
1

As both other answers noted, the problem is that Git is doing a similarity analysis and guessing that the files might have been modified-and-renamed. This false matching-up is harmless, though somewhat alarming at first.

The full details are quite complex (see my answer to how does git log --follow <filename> work?), but the short version is that git status runs a git diff from commit (HEAD) to index, with rename detection turned on (giving the internal default of "50% similar"). Because PDF files tend to have big repeated binary clumps whose 64-byte chunks will hash to the same slot, the chances that any two PDF files are considered "at least 50% similar" are ... well, "high" is too strong: "not low" would be more accurate. You are hitting 30 out of 128 in the example above, or just over 1 out of 5 files are getting false 50%+ matches.

The similarity score might be useful if these were not PDF files. But you can't turn off rename detection in git status: it's always on, with a a limit of 200 unpaired files.

(After git status runs the HEAD-vs-index diff, it runs a second, index-vs-files, diff. That one does not have rename detection enabled since it makes no sense here. I mention it only because it's not obvious, at first, that what git status does is to run two git diffs.)

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775