Resume rebuild git repo

Question

I would like to replay a git repo with some code reformatting and other code filters ... and yes I am aware of all the risks of doing so.

Unfortunately, this takes very long, it is impossible to freeze the work for so long. I know how I can replay a branch at some point.

What I am looking for is ideas how I can replay a branch from another repo and to have a resume.

Essentially algorithm like this in pseudo code:

starting_sha = very_last
if resume {
    starting_sha = last_applied_sha
}
for_each sha = commit --reversed from starting_sha to the HEAD {
    git checkout sha
    apply some changes to the code
    git commit to target repo with metadata from sha
    update last_applied_sha = sha
}

Obviously, I can easily implement such a script, but git commit to target repo with metadata from sha is something that I wish I do not need to deal on my own.

I am hoping that there is some git filter-branch type of functionality that will allow me to do so, without the need of dealing with tags and any other internals on my own.

Aren't you just describing an interactive rebase where you edit every commit? — jonrsharpe, Dec 23 '17 at 19:27
@jonrsharpe, no. rebase works on changes that are already in the branch. Those changes are in a different repo. This is exactly the challenge, how to translate changes from a different repo, on an already replayed branch. Because they have nothing in common anymore for git to make sense of the metadata as it could do if the replay was not there. — gsf, Dec 23 '17 at 19:31
@gsf In an *interactive* rebase, each commit can be *edited*, which allows incorporating changes that haven't ever been recorded in a commit previously. — mkrieger1, Dec 23 '17 at 19:35
@mkrieger1 I am not sure that I follow, can you offer an answer with a bit more details, how this is going to work? — gsf, Dec 23 '17 at 19:40
See e.g. https://stackoverflow.com/questions/179123/how-to-modify-existing-unpushed-commits or https://stackoverflow.com/questions/1186535/how-to-modify-a-specified-commit-in-git — mkrieger1, Dec 23 '17 at 19:41
@mkrieger1 these have nothing to do with my problem The changes that keep flowing in the original repo, are not yet in the new target repo - what amends will do for me in such case? — gsf, Dec 23 '17 at 19:43
Change `pick` to `edit` for a commit you wish to edit. Once this commit has been rebased, the interactive rebase stops and you are basically at the `apply some changes to the code` step. When you have applied the changes (by editing the files yourself), use `git commit --amend` and then `git rebase --continue`. I suggest you try this out on a toy repository first. — mkrieger1, Dec 23 '17 at 19:46
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/161870/discussion-between-gsf-and-mkrieger1). — gsf, Dec 23 '17 at 19:46

mmlr · Answer 1 · 2017-12-24T09:53:26.197

1. Set up the target repository by cloning the source.

$ git clone <sourceRepo>

2. Check out the relevant branch. Replace branchname by the actual branch name (also in all the following steps).

$ git checkout branchname

3. Do an initial rewrite using filter-branch and a --tree-filter, updating tags in the process with --tag-name-filter. This is just an example filter that replaces the first occurrence of "text" with "modified" in all files matching the "*.txt" glob.

$ git filter-branch --tree-filter 'sed -i "s/text/modified/" *.txt' --tag-name-filter cat -- branchname

4. Create a tag to keep a record of the last source and target rev.

$ git tag lastsourcerev origin/branchname
$ git tag lasttargetrev branchname

Now whenever the time comes to update to new revisions from the source repo the following steps can be used. They only apply the tree-filter to the new commits and graft the new (rewritten) commits to the existing (previously rewritten) ones.

1. Fetch new commits/tags from the source repo:

$ git fetch origin

2. Reset to the new tip of the source branch.

$ git reset --hard origin/branchname

3. Apply filter-branch with an extra --parent-filter to graft the new commits to the existing ones. Note that we need the -f (force) option as the previous filter-branch command left refs/original. The --parent-filter makes use of the tags that stored the last source and target revs. The whole filter-branch is limited to the commits between the last processed source rev and the newest source commit (that we reset branchname to).

$ git filter-branch -f --tree-filter 'sed -i "s/text/modified/" *.txt' --tag-name-filter cat --parent-filter "sed s/$(git rev-parse lastsourcerev)/$(git rev-parse lasttargetrev)/g" -- lastsourcerev..branchname

4. Update the tracking tags to the new situation:

$ git tag -f lastsourcerev origin/branchname
$ git tag -f lasttargetrev branchname

Repeat these steps as needed. Once no more updates are to be done, the lastsourcerev and lasttargetrev helper tags can be deleted.

Note that the update process could be arbitrarily split into smaller increments by resetting the branch to some in-between commit from source and recording that commit as lastsourcerev. Likewise the initial rewrite could be split up by creating a branch pointing at an in-between commit from source and recording that as lastsourcerev and then applying the update steps to go further.

Note also that this process relies solely on filter-branch to avoid any problems regarding tag rewrites or merge commits that rebasing newly incoming commits would otherwise inevitably cause.

Packaged as a shell script the incremental update part could look like this:

#!/bin/sh

REMOTE=origin
LOCAL_BRANCH=master
REMOTE_BRANCH=origin/master
SOURCE_REV_TAG=lastsourcerev
TARGET_REV_TAG=lasttargetrev
TREE_FILTER='sed -i "s/text/modified/" *.txt'

set -e

git fetch "$REMOTE"

if [ $(git rev-parse "$SOURCE_REV_TAG") = $(git rev-parse "$REMOTE_BRANCH") ]
then
    echo "no new commits, nothing to do"
    exit 0
fi

git checkout "$LOCAL_BRANCH"
git reset --hard "$REMOTE_BRANCH"

git filter-branch -f --tree-filter "$TREE_FILTER" \
    --tag-name-filter cat \
    --parent-filter "sed s/$(git rev-parse "$SOURCE_REV_TAG")/$(git rev-parse "$TARGET_REV_TAG")/g" \
    -- "$SOURCE_REV_TAG"..

git tag -f "$SOURCE_REV_TAG" "$REMOTE_BRANCH"
git tag -f "$TARGET_REV_TAG"

The only edge case that comes up is when no new commits are available. In such a case the git reset --hard would update the local branch to the remote branch, but then no filter step would be applied because no revs are to be rewritten. The script above handles that by checking if the source rev tracking tag points at the same commit as the remote branch.

trying this model, but `git reset --hard origin/branchname` essential removes the work got done so far. Do I miss something? — gsf, Dec 23 '17 at 23:34
The reset indeed does initially discard the respective previous rewrite. It is however still referenced in the `lasttargetrev` tag and subsequently grafted onto the rewritten new commits with the `--parent-filter` in the next step. A less scary version could be made using separate helper branches, but this way the clutter is reduced to a minimum. — mmlr, Dec 23 '17 at 23:42
This is really great and works mostly. When I tested it though I found that there are some changes (not sure how they are different yet) if they are the split, the result gets corrupted. It is 100% reproducible for such changes and the results are either, the connection with the already rebuilt history get lost, or somehow you get the rebuilt and the original together. Any idea what might be the problem. — gsf, Dec 24 '17 at 17:37
Without knowing more about these specific commits it's not really possible to guess. The only thing that comes to mind are merge commits, as they have multiple parents. But I tested with merges as well and they didn't pose any problem. Is there any error output when running the script? Maybe you can share the output and/or something about the state, for example where the source/target rev tag point at before and after and on what respective branch those commits are (use `git branch -a --contains ` to find out). — mmlr, Dec 24 '17 at 21:59
Also what exactly do you mean by "the rebuilt and the original together"? As in some kind of merge with both histories as parents? Or one history appended to the other? — mmlr, Dec 24 '17 at 22:04
unfortunately, I was playing in a private repo to be able to share too many details. I will try to reproduce the problem in some public repo. There are no errors - it finishes, but the result looks bad. As for together thing, you can literally see the original and the replayed change together next to each other when `git log`. Else the merge changes were the first thing that I checked, it is not that. — gsf, Dec 24 '17 at 22:24
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/161915/discussion-between-gsf-and-mmlr). — gsf, Dec 24 '17 at 22:25

VonC · Answer 2 · 2017-12-23T22:41:37.280

Rather than an interactive rebase, you could apply a git filter-branch which would visit every commit of your repo and apply any utility (or code reformatting) you want.

Since the filter-branch is a local operation, there is no need for "another" repo: you apply it to a local clone of your repo.
Note that it does not support a pause/resume workflow, so you will need to let it process to completion.

See "Reformatting Your Codebase with git filter-branch" (by Elliot Chance) as an example:

git filter-branch --tree-filter 'phpcbf $(\
  git show $GIT_COMMIT --name-status | egrep ^[AM] |\
    grep .php | cut -f2)' -- --all

For each commit, that would look for added/modified files only, isolate the php ones and apply a formatting tool.

That does not prevent anyone to commit during this time.
Your collaborators will need to clone the new (formatted) repo, add their own as a remote, fetch, and rebase their own commits (only their new ones) on top of the (newly formatted) branch history of the new repo.
In other words, a reconciliation step is to be done by each collaborator, in order to integrate back the work done during the reformat stage.

If not, the process needs to be reversed, and your new repo must add the old one (where everybody has push to, assuming the recent commits are properly formatted) as a remote (named 'oldRepo'):

cd /path/to/new/repo
git remote add oldRepo /path/to/old/central/repo
git fetch oldRepo

you can lists commits after a certain date (ie the date of the beginning of your formatting process)
for each commits (from oldest to newest), find its branch (git branch --contains)
for each new branch, do a git rebase --onto abranch acommit~ oldRepo/abranch

That will replay all commits after the parent of the old commit detected on a branch 'oldRepo/abranch' to the new repo abranch (which is missing commits, since they were done and pushed while that new repo was being rewritten)

I need the resume and two repos, because I cannot afford to say no one will commit until the rebuild finish in 3 days — gsf, Dec 23 '17 at 22:20
@gsf I hadn't seen your comment (cross posting). I don't think your colleague should refrain from doing commits, but they will need to re-integrate them on top of the branch of the new repo. I have edited my answer. — VonC, Dec 23 '17 at 22:24
that seem easier to say, than to convince someone to do. This is why I am looking for a solution that will allow for me to automatically catch up with the work in the current repo until the moment is right to switch to the new rebuilt one. — gsf, Dec 23 '17 at 22:28
@gsf OK, you would need to reverse the process: import the old repo where everybody has pushed to, fetch it into your new repo finally formatted, and, for each commit done after the starting date of your formatting process, re-apply the branch (from that commit) onto the new branch (assuming the recent commits are properly formatted already) — VonC, Dec 23 '17 at 22:31
@gsf OK. I have edited the answer to lay out the necessary steps of that reconciliation process. — VonC, Dec 23 '17 at 22:42

Resume rebuild git repo

2 Answers2