4

I was asked this question on #git earlier but as its reasonably substantial I'll post it up here. I want to run a filter-branch on a repo to modify (thousands of) files over hundreds of commits using a python script. I'm calling the clean.py script using the following command in the repo directory:

git filter-branch -f --tree-filter '(cd ../cleaner/ && python clean.py --path=files/*/*/**)'

Clean.py looks like this and will modify all files in path (i.e. files/*/*/**):

from os import environ as environment
import argparse, yaml
import logging
from cleaner import Cleaner

parser = argparse.ArgumentParser()
parser.add_argument("--path", help="path to run cleaner on", type=str)
args = parser.parse_args()

# logging.basicConfig(level=logging.DEBUG)

with open("config.yml") as sets:
    config = yaml.load(sets)

path = args.path
if not path:
    path = config["cleaner"]["general_pattern"]

cleaner = Cleaner(config["cleaner"])

print "Cleaning path: " + str(path)
cleaner.clean(path, True)

After running the command the following is outputted to terminal:

$ python deploy.py --verbose
INFO:root:Checked out master branch
INFO:root:Running command:
'git filter-branch -f --tree-filter '(cd C:/Users/Graeme/Documents/programming/clean-cdn/clean-jsdelivr/ && python clean.py --path=files/*/*/**)' -d "../tmp"' in ../jsdelivr
Rewrite 298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e (1/1535)
Cleaning path: files/*/*/**

C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 343: ../commit: No such file or directory
C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 346: ../map/298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e
: No such file or directory
could not write rewritten commit
rm: cannot remove `/c/Users/Graeme/Documents/programming/clean-cdn/tmp/revs': Permission denied
rm: cannot remove directory `/c/Users/Graeme/Documents/programming/clean-cdn/tmp': Directory not empty

The python script executes successfully and modifies the files correctly but the filter-branch doesn't finish fixing up the commit. There appears to be a permission issue however I haven't been able to get around it running with elevated privileges. I've tried running the filter-branch on win7, win8, and ubuntu with git v1.8 and v1.9.
Edit The script works as is on Centros with git1.7.1

The goal is to reduce the size of a CDNs repo (nearing 1GB) after the contents in files/*/*/** finishes syncing with a database.
The source code of the project
Target repo for the rewrite

megawac
  • 10,953
  • 5
  • 40
  • 61

3 Answers3

2

The permissions issue you're encountering is interesting-are you doing this on a local copy of the repo (ie one where you have full access to the filesystem), or on a remote server?

Reading over your python code, it looks like you're trying to remove every file over a certain size that is not a .INI file, did I get that right?

If that's the case, can I ask if you've considered The BFG Repo-Cleaner? Obviously, you learn a lot about Git by writing your own code (I know I have), but I think The BFG is probably tailor-made for your needs - and will be faster than any git-filter-branch based approach.

In your case, you might want to run it with a command like:

$ java -jar bfg.jar --strip-blobs-bigger-than 100K  my-repo.git

This removes all blobs bigger than 100K, that aren't in your latest commit.

I did a quick run with this on the jsdelivr repo, and reduced pack size from 284M to 138M in the cleaned repo. The BFG cleaning step took under 5 seconds, the subsequent git gc --prune=now --aggressive just under 2 minutes.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
  • Also our current files aren't sacred - is there anyway to have your tool hit all commits to `HEAD` – megawac Mar 30 '14 at 13:18
  • Re the sacred: --no-blob-protection is your (scary) friend! – Roberto Tyley Mar 30 '14 at 13:23
  • Alright neat - looks promising. Anyway to specify the `***REMOVED***` text and does your project support globbed paths? – megawac Mar 30 '14 at 13:24
  • If you want to do line-by-line text-substitution (rather than just deleting files?) then you can configure the replacement text thusly: http://stackoverflow.com/a/15730571/438886 – Roberto Tyley Mar 30 '14 at 13:46
  • I'm curious about the need to 'nullify' (ie set file length to zero) the files, rather than delete them - how does that work better for your use case? – Roberto Tyley Mar 30 '14 at 16:59
  • Its [a request by the project owner](https://github.com/jsdelivr/jsdelivr/issues/347#issuecomment-36888773) to keep the file structure appearing the same but the file contents empty. I guess its to make it easy to maintain folder structure in future version commits – megawac Mar 30 '14 at 18:49
  • 1
    Ah thanks, context is good! It wouldn't be hard to change the BFG to zero the files (https://github.com/rtyley/bfg-repo-cleaner/blob/ed21bed/bfg-library/src/main/scala/com/madgag/git/bfg/cleaner/treeblobs.scala#L41 ), but from reading issue 347, I don't think it's essential to the spirit of what you're trying to do - replacement files called 'filename.REMOVED.git-id' would be fine I think. Overall, I'm not sure that /frequent/ history rewrites would be good for the jsdelivr project tho' - would make it rather confusing for people submitting pull-requests? – Roberto Tyley Mar 30 '14 at 22:57
  • Hmm quite likely, users would have to rebase all the time eh? – megawac Mar 30 '14 at 23:02
  • Yup - so you're better off just doing one Big clean, and getting the benefit from that. You could do them annually. Unfortunately, Git, by itself, is not a great place for storing transient large files. Storing all history means repo size balloons... personally, I think jsDelivr would be better served asking contributors to make their artifacts available using the GitHub releases system... https://developer.github.com/v3/repos/releases/#upload-a-release-asset - and then just pull-request the metadata pointing to that artifact. – Roberto Tyley Mar 30 '14 at 23:49
  • Agreed, and I think we're going to eventually mix the two solutions but we have an idea in mind that doesn't force constant rebasing. Your tool is sweet (kind of want to figure it out now) - if you wouldn't mind suggesting how to empty the files and either stating *ignore paths* or *include paths* I'll go ahead and accept this – megawac Apr 05 '14 at 06:00
  • Haven't used Java in years sorry for dumb questions figuring out whats possible with your API... I noticed reading through your code that only [`List('K', 'M', 'G', 'T', 'P)`](https://github.com/rtyley/bfg-repo-cleaner/blob/master/bfg-library/src/main/scala/com/madgag/text/ByteSize.scala#L27) file sizes are supported. Is there anyway to set the size in bytes :) – megawac Apr 05 '14 at 16:36
  • @megawac: `touch filename` doesn't nullify files (if by nullifying you mean truncation to zero size), at least not on Linux. – Pavel Šimerda Apr 05 '14 at 19:46
  • 1
    Regarding byte-size - I've just cut release v1.11.3 of The BFG, with support for filtering files by single-byte filesizes! Will be visible at http://repo1.maven.org/maven2/com/madgag/bfg/ within a few hours. – Roberto Tyley Apr 05 '14 at 20:55
  • So would it be possible to write `bfg repo.git --delete-files files/*/*/**` (apparently doesn't take paths?) or preferably replace each file under `files/*/*/**` with an empty file with some filter? – megawac Apr 07 '14 at 05:00
  • Unfortunately The BFG doesn't currently support filtering by paths, and it's actually a quite fundamental architectural change to make. For your convenience, I would suggest you just re-enable blob protection - so that your crucial top-level files are not removed - and execute '$ bfg --strip-blobs-bigger-than 1K', or possibly '$ bfg --strip-blobs-bigger-than 1K --delete-files "*.js"' – Roberto Tyley Apr 07 '14 at 13:19
1

You should not cd to another directory as the git-filter-branch script will use relative paths to access the files.

michas
  • 25,361
  • 15
  • 76
  • 121
  • The script loads some `.yml` files in its relative directory and filter branch executes the command in the context of the repos path. AFAIK theres no way to set a `cwd` path – megawac Mar 30 '14 at 13:08
0

Consider using BFG. It is much faster and simpler to use.

ash
  • 101
  • 1
  • 7