git fast-import for existing files in the repository

Question

I have local git repository and a large number of files (~4 million, total size ~700GB) that I want to check into git. Using git filters, I want to track not the files' real contents but only some reference to the file (similar to what git lfs is doing). Adding and committing the files (in chunks) still takes a very long time and I'm hoping to reduce that time by using git fast-import.

I can't figure out, though, how to exactly replicate git add <file> && git commit -m <message> using git fast-import. Let's consider the following situation:

mkdir /tmp/git_fast_test && cd /tmp/git_fast_test
git init
echo "1234" > testfile

Now I run the following python script, which commits a file testfile with mode 644 and content 1234 to the git repo. This should now correspond exactly to /tmp/git_fast_test/testfile.

import subprocess
import time

proc = subprocess.Popen(["git", "fast-import"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, cwd="/tmp/git_fast_test")
proc.stdin.write(b"commit refs/heads/master\n")
proc.stdin.write(b"committer Me <me@me.org> %d +0100\n" % int(time.time()))

# commit message
proc.stdin.write(b"data 5\n")
proc.stdin.write(b"abcde\n")

# add file a with content `1234`
proc.stdin.write(b"M 644 inline testfile\n")
proc.stdin.write(b"data 4\n")
proc.stdin.write(b"1234\n")

proc.stdin.flush()
proc.stdin.close()

However, in the repo I'm seeing this:

$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    deleted:    testfile

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    testfile

although git seems to know about testfile:

$ git show testfile
commit 85f343fa205665e7304dfbad1725b640a0d03b01 (HEAD -> master)
Author: Me <me@me.org>
Date:   Thu Jan 7 08:47:39 2021 +0100

    abcde

diff --git a/testfile b/testfile
new file mode 100644
index 0000000..274c005
--- /dev/null
+++ b/testfile
@@ -0,0 +1 @@
+1234
\ No newline at end of file

So, how can I tweak my git fast-import script to make git believe that the file /tmp/git_fast_test/testfile is exactly what is stored in it's index?

I found an example shell script in the original git source that should do almost exactly what I want to do and have the same issue with that script. So I believe this is the intended behavior of git fast-import...

(note : I don't know the internals of fast-import) Perhaps the issue is that the index is not up to date ? have you tried running `git update-index --refresh` or `git update-index --really-refresh` ? — LeGEC, Jan 07 '21 at 09:56
I haven't, however `git reset HEAD` after the fast-import seems to clear everything up nicely and yields the desired result. I'm not sure what causes the delete stages, but at least that's a workaround. — janoliver, Jan 07 '21 at 10:18

score 1 · Accepted Answer · answered Jan 07 '21 at 14:04

LeGEC's comment is in fact the right answer: fast-import bypasses the normal index-and-work-tree system, and when git fast-import exits, you have a bunch of commits but Git's index does not match any of them. If you just created the repository, Git's index is completely empty, so the proposed next commit is that of an empty tree. The comparison with the current commit will therefore say "to modify the current commit to make it match the proposed next commit, delete every file".

The fix is to run git reset (or git restore or git read-tree) to load Git's index. You can optionally reset your work-tree at this time as well.

Thank you for the explanation. That's what I get for using something advanced like `fast-import` without really knowing the internals of git. ;) `git reset --mixed HEAD` is what I'm using now. — janoliver, Jan 08 '21 at 10:22

git fast-import for existing files in the repository

1 Answers1