0

The git version control system, is a kind of distributed log (with some conceptual similarities to the raft consensus protocol).

Raft and some other systems have a concept of log compaction, so redundant changesets don't bulk down the overall log of changes.

What I want is to 'bulk clean' deleted files - not isolate a single one for exclusion.

My question is: Can I flatten out deleted files from a git repository?

EDIT:

  • suppose in my history - I have five separate scenarios of someone checking in five different 100M binary files at different points in time - and I'd rather not have to download that each time someone does a clone. I'm looking for a 'bulk clean of deleted files from my repo' whilst still keeping my repo.
hawkeye
  • 34,745
  • 30
  • 150
  • 304
  • 1
    What do you mean by "clean" or "flatten out" deleted files? They can't be removed from the repository if they're referenced by any commits; otherwise you'd be losing part of your revision history. But Git stores files by content hash, so there's only one copy of each version of a file, no matter how many revisions it's part of. – Wyzard May 08 '16 at 09:53
  • All files except the current tree are "deleted". If you want to get rid of them just create a fresh repository and import current tree into it. What exactly do you want to remove and what to keep? – Piotr Praszmo May 08 '16 at 10:08
  • Thanks @Wyzard - I've clarified the scenario. – hawkeye May 08 '16 at 11:28
  • Thanks @Banthar - that's enormously helpful. I just wanted something that was half-way to that - so the 100 people using my repo don't have to change repos. – hawkeye May 08 '16 at 11:28
  • 1
    If someone committed a big file by mistake, you can [rewrite history](https://www.kernel.org/pub/software/scm/git/docs/user-manual.html#problems-With-rewriting-history) to squash add/remove commits like it never happened. See http://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-git-repository and related questions. – Piotr Praszmo May 08 '16 at 11:38
  • Sure I can do that one by one - but is there a 'bulk' way to do this? – hawkeye May 08 '16 at 11:42
  • See the first answer in linked question. BFG lets you remove all deleted blobs by name and size. – Piotr Praszmo May 08 '16 at 11:46

2 Answers2

0

"suppose in my history - I have five separate scenarios of someone checking in a 100M file - and I'd rather not have to download that each time someone does a clone."

Git already does this. As long as the file contents are the same, its hash will be the same. Git uses hashes to identify files, and so the file will resolve to the same hash and will not result in increased space usage.

If, on the other hand, the file contents are slightly different, then the space may or may not be saved, depending on various details of where they are in the git tree, and the options used when a git gc is performed. (Supposing the files are diffable. Binary files may or may not be. Look up git delta compression.)

Having said all that, git is in many ways does not work well with large binary files (I'm assuming that 100 MB files are binary, though they are perhaps not) and you may want to look at something like git large files or something else within git to support large files, or an scm other than git.

Mort
  • 3,379
  • 1
  • 25
  • 40
0

Ok - here is the list of things to check:

You can run:

git gc

You can get information using:

git count-objects -v

There is a script here for git-fatfiles.

This is a script for recreating all the branches in a new repo.

Using this you can list the big objects and sort them:

git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3n

Using this you can find which commit had the blob that takes up the space.

Community
  • 1
  • 1
hawkeye
  • 34,745
  • 30
  • 150
  • 304