20

Let's say you have the repository:

myCode/megaProject/moduleA
myCode/megaProject/moduleB

Over time (months), you re-organise the project. Refactoring the code to make the modules independent. Files in the megaProject directory get moved into their own directories. Emphasis on move - the history of these files is preserved.

myCode/megaProject
myCode/moduleA
myCode/moduleB

Now you wish to move these modules to their own GIT repos. Leaving the original with just megaProject on its own.

myCode/megaProject
newRepoA/moduleA
newRepoB/moduleB

The filter-branch command is documentated to do this but it doesn't follow history when files were moved outside of the target directory. So the history begins when the files were moved into their new directory, not the history the files had then they lived in the old megaProject directory.

How to split a GIT history based on a target directory, and, follow history outside of this path - leaving only commit history related to these files and nothing else?

The numerous other answers on SO focus on generally splitting apart the repo - but make no mention of splitting apart and following the move history.

simbolo
  • 7,279
  • 6
  • 56
  • 96

6 Answers6

9

This is a version based on @rksawyer's scripts, but it uses git-filter-repo instead. I found it was much easier to use and much much faster than git-filter-branch (and is now recommended by git as a replacement).

# This script should run in the same folder as the project folder is.
# This script uses git-filter-repo (https://github.com/newren/git-filter-repo).
# The list of files and folders that you want to keep should be named <your_repo_folder_name>_KEEP.txt. I should contain a line end in the last line, otherwise the last file/folder will be skipped.
# The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
# Tags are not preserved, see line below to preserve tags.
# Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.

# Define here the name of the folder containing the repo: 
GIT_REPO="git-test-orig"

clone="$GIT_REPO"_REWRITE_CLONE
temp=/tmp/git_rewrite_temp
rm -Rf "$clone"_BKP
mv "$clone" "$clone"_BKP
rm -Rf "$temp"
mkdir "$temp"
git clone "$GIT_REPO" "$clone"
cd "$clone"
git remote remove origin
open .
open "$temp"

# Comment line below to preserve tags
git tag | xargs git tag -d

echo 'Start logging file history...'
echo "# git log results:\n" > "$temp"/log.txt

while read p
do
    shopt -s dotglob
    find "$p" -type f > "$temp"/temp
    while read f
    do
        echo "## " "$f" >> "$temp"/log.txt
        # print every file and follow to get any previous renames
        # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
        git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt
        
        echo "\n\n" >> "$temp"/log.txt
    done < "$temp"/temp
done < ../"$GIT_REPO"_KEEP.txt > "$temp"/PRESERVE

mv "$temp"/PRESERVE "$temp"/PRESERVE_full
awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE

sort -o "$temp"/PRESERVE "$temp"/PRESERVE

echo 'Starting filter-branch --------------------------'
git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
echo 'Finished filter-branch --------------------------'

It logs the result of git log into a file in /tmp/git_rewrite_temp/log.txt, so you can get rid of these lines if you don't need a log.txt and want it to run faster.

noelicus
  • 14,468
  • 3
  • 92
  • 111
Roberto
  • 11,557
  • 16
  • 54
  • 68
  • 1
    Awesome example of the use of an awesome tool! After a day of troubles with filter-branch, running for 40 minutes only not to work, this solved it correctly in about 5 seconds. – Tobb Feb 11 '20 at 12:32
  • I had some messy old, empty commits, so I ended up adding `--prune-empty always`to the filter-repo command. – Tobb Feb 11 '20 at 12:35
  • The auto setting will prune all commits that end up as empty when rewriting the repo. In my case, I guess I have actual empty commits. They seem to originate from the repo before it was git (svn), and probably wound up empty for some reason, either through svn being svn, or in the migration to git. Anyways, no reason to keep the commits, and they should probably just be removed from the original repo itself. – Tobb Feb 12 '20 at 09:30
  • 1
    I'm kind of new to git-filter-repo, but reading through the documentation, shouldn't `git filter-repo --analyze` be able to give you information on renames? – Leo Dec 18 '20 at 13:19
  • I found your shell script version a little too different from what I'd have implemented to feel comfortable with it, so I [wrote one in Python](https://gist.github.com/ssokolow/b2e3247db0cac3d14cf2bac07ccbf963) which behaves more similarly to bare `git-filter-repo`, has `--help`, and has a bunch of safety guards. I'm not sure what would be the most appropriate way to make it its own answer in this particular case. (It's a Gist, but it's also too long to code-block here IMO.) – ssokolow Nov 25 '21 at 10:15
  • I'd add it as an answer. If it's an improvement it's better for the community, so deserves more visibility. Although I know my script works well, my shell skills are meagre so the code is ugly. – Roberto Nov 26 '21 at 20:36
4

Running git filter-branch --subdirectory-filter in your cloned repository will remove all commits that don't affect content in that subdirectory, which includes those affecting the files before they were moved.

Instead, you need to use the --index-filter flag with a script to delete all files you're not interested in, and the --prune-empty flag to ignore any commits affecting other content.

There's a blog post from Kevin Deldycke with a good example of this:

git filter-branch --prune-empty --tree-filter 'find ./ -maxdepth 1 -not -path "./e107*" -and -not -path "./wordpress-e107*" -and -not -path "./.git" -and -not -path "./" -print -exec rm -rf "{}" \;' -- --all

This command effectively checks out each commit in turn, deletes all uninteresting files from the working directory and, if anything has changed from the last commit then it checks it in (rewriting the history as it goes). You would need to tweak that command to delete all files except those in, say, /moduleA, /megaProject/moduleA and the specific files you want to keep from /megaProject.

Matthew Strawbridge
  • 19,940
  • 10
  • 72
  • 93
  • It didn't work for me, for some reason it deletes `.git/refs/heads`, destroying my repo. Interestingly enough not all files inside `.git` are deleted. Do you know why this may be happening? Also, I fail to see how this solution would preserve moves/renames. – Roberto Dec 17 '19 at 01:50
2

I'm aware of no simple way to do this, but it can be done.

The problem with filter-branch is that it works by

applying custom filters on each revision

If you can create a filter which won't delete your files they will be tracked between directories. Of course this is likely to be non-trivial for any repository which isn't trivial.

To start: Let's assume it is a trivial repository. You have never renamed a file, and you have never had files in two modules with the same name. All you need to do is get a list of the files in your module find megaProject/moduleA -type f -printf "%f\n" > preserve and then run your filter using those filenames, and your directory:

preserve.sh

cmd="find . -type f ! -name d1"
while read f; do
  cmd="$cmd ! -name $f"
done < /path/to/myCode/preserve
for i in $($cmd)
do
  rm $i
done

git filter-branch --prune-empty --tree-filter '/path/to/myCode/preserve.sh' HEAD

Of course it's renames that make this difficult. One of the nice things that git filter-branch does is gives you the $GIT_COMMIT environment variable. You can then get fancy and use things like:

for f in megaProject/moduleA
do
 git log --pretty=format:'%H' --name-only --follow -- $f |  awk '{ if($0 != ""){ printf $0 ":"; next; } print; }'
done > preserve

to build a filename history, with commits, that could be used in place of the simple preserve file in the trivial example, but the onus is going to be on you to keep track of what files should be present at each commit. This actually shouldn't be too hard to code out, but I haven't seen anybody who's done it yet.

Guildencrantz
  • 1,875
  • 1
  • 16
  • 30
1

Following on to the answer above. First iterate through all of the files in the directory that is being kept using git log --follow to git the old paths/names from prior moves/renames. Then use filter-branch to iterate through every revision removing any files that were not on the list created in step 1.

#!/bin/bash
DIRNAME=dirD

# Catch all files including hidden files
shopt -s dotglob
for f in $DIRNAME/*
do
# print every file and follow to get any previous renames
# Then remove blank lines.  Then remove every other line to end up with the list of filenames
 git log --pretty=format:'%H' --name-only --follow -- $f | awk 'NF > 0' | awk 'NR%2==0'
done > /tmp/PRESERVE

sort -o /tmp/PRESERVE /tmp/PRESERVE
cat /tmp/PRESERVE

Then create a script (preserve.sh) that filter-branch will call for each revision.

#!/bin/bash
DIRNAME=dirD

# Delete everything that's not in the PRESERVE list
echo 'delete this files:'
cmd=`find . -type f -not -path './.git/*' -not -path './$DIRNAME/*'`
echo $cmd > /tmp/ALL


# Convert to one filename per line and remove the lead ./
cat /tmp/ALL | awk '{NF++;while(NF-->1)print $NF}' | cut -c3- > /tmp/ALL2
sort -o /tmp/ALL2 /tmp/ALL2

#echo 'before:'
#cat /tmp/ALL2

comm -23 /tmp/ALL2 /tmp/PRESERVE > /tmp/DELETE_THESE
echo 'delete these:'
cat /tmp/DELETE_THESE
#exit 0

while read f; do
  rm $f
done < /tmp/DELETE_THESE

Now use filter-branch, if all files are removed in the revision, then prune that commit and it's message.

 git filter-branch --prune-empty --tree-filter '/FULL_PATH/preserve.sh' master
Roberto
  • 11,557
  • 16
  • 54
  • 68
rksawyer
  • 11
  • 1
  • This works well! I had only to change a few things to make it work with paths that contain spaces. – Roberto Dec 22 '19 at 22:28
  • @Roberto Hi, by any chance, do you still have the version that fixes the spaces? – Stals Jan 20 '20 at 16:28
  • @Stals Hi. You have to add quotes when using the variables, like "$DIRNAME". I posted mine as a new answer. – Roberto Jan 21 '20 at 00:45
0

Here's my version of the script @Roberto posted, written for linux/wsl. If you don't specify a "myrepo_KEEP.txt" it will create one based on the current file structure. Pass in the repo to work on:

prune.sh MyRepo

# This script should run one level up from the git repo folder (i.e. the  containing folder)
# This script uses git-filter-repo (github.com/newren/git-filter-repo).
# The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
# Tags are not preserved, see line below to preserve tags.
# Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.
# Optionally, list the files and folders that you want to keep the KEEP_FILE (<your_repo_folder_name>_KEEP.txt) 
## It should contain a line end in the last line, otherwise the last file/folder will be skipped.
## If this file is missing it will be created by this script with all current folders listed. 

echo "Prune git repo"

# User needs to pass in the repo name
GIT_REPO=$1

if [ -z $GIT_REPO ]; then
    echo "Pass in the directory to prune"
else
    KEEP_FILE="${GIT_REPO}"_KEEP.txt

    # Build up a list of current directories in the repo, if one hasn't been supplied
    if [ ! -f "${KEEP_FILE}" ]; then
        echo "Keeping all current files in repo (generating keep file)"
        cd $GIT_REPO
        find . -type d -not -path '*/\.*' > "../${KEEP_FILE}"
        cd ..
    fi

    echo "Pruning $GIT_REPO"

    clone="${GIT_REPO}_REWRITE_CLONE"
    
    # Shift backup
    bkp="${clone}_BKP"
    temp=/tmp/git_rewrite_temp
    echo $clone
    rm -Rf "$bkp"
    mv "$clone" "$bkp"
    
    # Setup temp
    rm -Rf "$temp"
    mkdir "$temp"   
    
    # Clone
    echo "Cloning repo...from $GIT_REPO to $clone"
    if git clone "$GIT_REPO" "$clone"; then
        cd "$clone"
        git remote remove origin

        # Comment line below to preserve tags
        git tag | xargs git tag -d

        echo 'Start logging file history...'
        echo "# git log results:\n" > "$temp"/log.txt

        # Follow the renames
        while read p
        do
            shopt -s dotglob
            find "$p" -type f > "$temp"/temp
            while read f
            do
                echo "## " "$f" >> "$temp"/log.txt
                # print every file and follow to get any previous renames
                # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
                git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt

                echo "\n\n" >> "$temp"/log.txt
            done < "$temp"/temp
        done < ../"${KEEP_FILE}" > "$temp"/PRESERVE

        mv "$temp"/PRESERVE "$temp"/PRESERVE_full
        awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE

        sort -o "$temp"/PRESERVE "$temp"/PRESERVE

        echo 'Starting filter-branch --------------------------'
        git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
        echo 'Finished filter-branch --------------------------'
        cd ..
    fi
fi

Credit to @rksawyer & @Roberto.

noelicus
  • 14,468
  • 3
  • 92
  • 111
-2

We painted ourselves into a much worse corner, with dozens of projects across dozens of branches, with each project dependent on 1-4 others, and 56k commits total. filter-branch was taking up to 24 hours just to split a single directory off.

I ended up writing a tool in .NET using libgit2sharp and raw file system access to split an arbitrary number of directories per project, and only preserve relevant commits/branches/tags for each project in the new repos. Instead of modifying the source repo, it writes out N other repos with only the configured paths/refs.

You're welcome to see if this suits your needs, modify it, etc. https://github.com/CurseStaff/GitSplit

Chip Paul
  • 47
  • 2
  • The linked repo doesn't exist or isn't public. – ChrisW Aug 31 '22 at 07:34
  • Sounds great, would be nice to be able to see it? If you want this answer to be upvoted you'll want to post some useful details not just posting a hyperlink too, btw. – noelicus Sep 14 '22 at 15:52