I have local git repository and a large number of files (~4 million, total size ~700GB) that I want to check into git. Using git filters, I want to track not the files' real contents but only some reference to the file (similar to what git lfs
is doing). Adding and committing the files (in chunks) still takes a very long time and I'm hoping to reduce that time by using git fast-import
.
I can't figure out, though, how to exactly replicate git add <file> && git commit -m <message>
using git fast-import
. Let's consider the following situation:
mkdir /tmp/git_fast_test && cd /tmp/git_fast_test
git init
echo "1234" > testfile
Now I run the following python script, which commits a file testfile
with mode 644 and content 1234
to the git repo. This should now correspond exactly to /tmp/git_fast_test/testfile
.
import subprocess
import time
proc = subprocess.Popen(["git", "fast-import"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, cwd="/tmp/git_fast_test")
proc.stdin.write(b"commit refs/heads/master\n")
proc.stdin.write(b"committer Me <me@me.org> %d +0100\n" % int(time.time()))
# commit message
proc.stdin.write(b"data 5\n")
proc.stdin.write(b"abcde\n")
# add file a with content `1234`
proc.stdin.write(b"M 644 inline testfile\n")
proc.stdin.write(b"data 4\n")
proc.stdin.write(b"1234\n")
proc.stdin.flush()
proc.stdin.close()
However, in the repo I'm seeing this:
$ git status
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
deleted: testfile
Untracked files:
(use "git add <file>..." to include in what will be committed)
testfile
although git seems to know about testfile:
$ git show testfile
commit 85f343fa205665e7304dfbad1725b640a0d03b01 (HEAD -> master)
Author: Me <me@me.org>
Date: Thu Jan 7 08:47:39 2021 +0100
abcde
diff --git a/testfile b/testfile
new file mode 100644
index 0000000..274c005
--- /dev/null
+++ b/testfile
@@ -0,0 +1 @@
+1234
\ No newline at end of file
So, how can I tweak my git fast-import
script to make git believe that the file /tmp/git_fast_test/testfile
is exactly what is stored in it's index?
I found an example shell script in the original git source that should do almost exactly what I want to do and have the same issue with that script. So I believe this is the intended behavior of git fast-import
...