1

I am abusing GIT to use it, locally, as an incremental backup solution. In part to teach me git, but in part to combat JPG and MP3 file corruption, which happens once in a blue moon.

The repo gets huge, obviously. I need to purge non-existent files from history. (I have a lot of security videos that go into the system automatically, but also get deleted later, and I don't need a fully checked in video feed of my front yard in my .git folder.)

This is a matter of abusing the tool in the "right" way -- I don't mind wasting a lot of space for the files I have; I don't mind a file having 100 versions if it's a file that exists. But if it doesn't exist, I want it out of the repo, with no way to ever bring it back; erased from history completely.

ClioCJS
  • 64
  • 3
  • 11
  • To remove a file completely, you'll have to rewrite the repository's history as though you hadn't added it. [Completely remove file from all Git repository commit history](http://stackoverflow.com/questions/307828/completely-remove-file-from-all-git-repository-commit-history) (especially the 2nd answer). – Jonathan Lonowski May 08 '16 at 23:38
  • I've looked at that before, but those are solutions for the situation of "I have one file to remove from the repo history, and I know its filename". In my case, it is more than one file, and I don't know its filename. Because the list of names would actually be "The set of files deleted since the last commit". It's very hard to get a list of files that don't currently exist! ;) – ClioCJS May 09 '16 at 03:01

3 Answers3

1

There's two good tools for this problem. BFG Repo Cleaner can delete large files from history. Git Large File Storage, aka git-lfs, lets you put large files in Git without bloating your repository size.

Put them together and you can use BFG to change old commits of large files to use gif-lfs with the new --convert-to-git-lfs option. Then use git-lfs for future commits of large files.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Forgive me if I am getting the wrong impression here, but BFG Repo Cleaner seems to be set up for removing files from a repository if they meet a certain definition of "Big". But that is not what I want to do. Even if a file is 400 megs, I want to keep it in the repo if it exists on my harddrive. But if I have deleted it, I want to remove all traces of its history from the repo. – ClioCJS May 09 '16 at 02:48
  • @ClintL I was suggesting you not delete them at all. Instead use BFG + git-lfs to convert large files to git-lfs. Then you can retain their history AND have a slim repo. Otherwise BFG can do what you want if you get a list of files which are no longer in HEAD. – Schwern May 09 '16 at 05:24
  • Gotcha. I'm new to this. But I actually don't want the history at all for files that no longer exist. (In my case, they've moved to a new and completely separate repo and will be tracked there; that's just how it is.) And for those that do, I don't mind a huge repo. I wouldn't mind a terabyte repo on a 200G music collection, for example, provided that the repo isn't tracking any files that are no longer on the harddrive. I'm working with 50TB of space. – ClioCJS May 09 '16 at 10:27
  • @ClintL You can make this work, and BFG is part of that, but it's going to be weird. Rewriting history in Git is a sometimes food, you want to make it normal procedure. I also don't understand your motivations. You say disk space is not an issue, so why go through the trouble? If it's to learn Git then lesson learned: BFG is the tool to remove files from history, but doing it all the time is a poor use of Git. Git is not an incremental backup. You could *make* an incremental backup using Git as the core... with a bunch of work. You should use an incremental backup system instead. – Schwern May 09 '16 at 19:27
  • The motivation is simple: I want to keep revisions on all my files - not code, but media (mp3+photos+personal videos) too. Space is not an issue *for that*, but I do not want to waste space keeping revisions on media I delete. I produce gigs of dashcam video every day. It is saved and repo'ed for some period of time, but eventually, it needs to be purged. I don't need a repo history on "video of Clint driving to work" for every day in my life. But I do keep such things for awhile for legal purposes, and I do revision-control them because, well, trust me that it's necessary. – ClioCJS May 10 '16 at 12:24
1

use the ls-tree HEAD to get the files in your

and then remove the files which are not there nay more with the
https://github.com/rtyley/bfg-repo-cleaner

It the prefect tool for this kind of task

BFG Repo-Cleaner

an alternative to git-filter-branch.

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Examples (from the official site)

In all these examples bfg is an alias for java -jar bfg.jar.

# Delete all files named 'id_rsa' or 'id_dsa' :
bfg --delete-files id_{dsa,rsa}  my-repo.git

enter image description here


After you have cleaned your repository use this tool to store your large files.

enter image description here

Community
  • 1
  • 1
CodeWizard
  • 128,036
  • 21
  • 144
  • 167
1

This is indeed fairly severe abuse of the tool. It would probably be better to figure out what is corrupting the original files. All Git will be really giving you here is content checksumming, which you can do outside Git ... or inside Git, with less-severe abuse, by using a data structure other than the usual chain of commits.

In other words, if you want to do this to learn how to use Git the wrong way :-) I think there is a "better wrong way". Here is my suggestion:

  • Make each commit on a new, orphan branch. You can do this with git checkout -b --orphan or by using the "plumbing" tools git write-tree and git commit-tree.

  • Each branch is to contain one and only one commit. (If you are using the plumbing tools, you can use tags instead of branches.)

  • Then, to delete a backup (the whole thing), simply delete the branch (or tag) name.

Diagrammatically, instead of:

o--o--o--...--o--o   <-- master

              ^  ^
              |   \
              |  the most recent
              |
         an hour ago, or yesterday, or whatever

your commits will be:

o   <-- backup-20160508T101112.13

o   <-- backup-20160508T131415.16

...

These names are more or less ISO-date-format, YYYYMMDDTHHMM.SS; but you may use any names that make the most sense to you.

Note that if two backups commit the same files, they re-use all the underlying Git "blob" objects, so two backups take basically the same space as one backup. Removing one of these two backups (by deleting the branch or tag name) has no effect since all those files are referred-to by the other backup.

If one file (xyz.txt) is slightly different, Git will delta-compress it against another file (in any other commit) in Git's usual way: the commits need not be joined by parent/child relationships. Note that image and movie files rarely compress well in Git anyway (because they already compressed: information theory says that if the first compression was any good, the second attempt will not help).

Now let's say you decide you no longer need to back up file foo.jpg. Just remove it: it will expire and be garbage-collected once the oldest backup is from "now". It's true that removed files will remain in older backups, but only for as long as you keep those backups.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Well, it's more than checksumming - It lets me restore previous versions. I've already purposefully corrupted files and restored them to their previous version to prove that it meets my standards. I just don't want the repo to be that big. It's not like I don't have 50TB of space, but I'd rather not waste it. – ClioCJS May 09 '16 at 02:46
  • I really like this idea, but I think it falls down if the files are intentionally modified. (ie, there are a bunch of orphan branches containing copies of that file, but no history for that file, because the parent lineage is lost using this method). The alternate structure is a branch per file, which is appended to each time the file changes, and completely discarded when the file is deleted (by deleting the branch of the same name). – Dave May 09 '16 at 03:06
  • @Dave: that's a nice alternative. It's more of a pain to work with directly (you'd want some sort of wrapper script for "check in" and "check out" steps, I think) and may deserve more complex branch management, e.g., a pseudo-branch (single commit) that maintains a list of which files are "the files" (all you really need is a branch name, but perhaps a sneaky method that makes a tree and points a commit to it, would also work). – torek May 09 '16 at 04:34
  • In my case, I don't have "one specific file I want to remove from history", but rather, "remove all files from the history if they don't exist on the disk". [meant to click 'submit' before going to sleep; I see some new comments popped out while I was sleeping]. – ClioCJS May 09 '16 at 10:24
  • Anyway, the implementation of this solution seems way too hard, when I feel like it should just be a single git command that, for some reason, the git authors have not thought to add ;) – ClioCJS May 09 '16 at 10:30