Git 2.24 (Q4 2019) makes git clean
more robust when it comes to nested Git repositories (not just folders)
See commit 69f272b (01 Oct 2019), and commit 902b90c, commit ca8b539, commit 09487f2, commit e86bbcf, commit 3aca580, commit 29b577b, commit 89a1f4a, commit a3d89d8, commit 404ebce, commit a5e916c, commit bbbb6b0, commit 7541cc5 (17 Sep 2019) by Elijah Newren (newren
).
(Merged by Junio C Hamano -- gitster
-- in commit aafb754, 11 Oct 2019)
clean
: avoid removing untracked files in a nested Git repository
Users expect files in a nested git repository to be left alone unless sufficiently forced (with two -f
's).
Unfortunately, in certain circumstances, git would delete both tracked (and possibly dirty) files and untracked files within a nested repository.
To explain how this happens, let's contrast a couple cases.
First, take the following example setup (which assumes we are already within a git repo):
git init nested
cd nested
>tracked
git add tracked
git commit -m init
>untracked
cd ..
In this setup, everything works as expected; running 'git clean -fd
' will result in fill_directory()
returning the following paths:
nested/
nested/tracked
nested/untracked
and then correct_untracked_entries()
would notice this can be compressed to:
nested/
and then since "nested/
" is a directory, we would call remove_dirs("nested/", ...)
, which would check is_nonbare_repository_dir()
and then decide to skip it.
However, if someone also creates an ignored file:
>nested/ignored
then running 'git clean -fd
' would result in fill_directory()
returning
the same paths:
nested/
nested/tracked
nested/untracked
but correct_untracked_entries()
will notice that we had ignored entries
under nested/ and thus simplify this list to
nested/tracked
nested/untracked
Since these are not directories, we do not call remove_dirs()
which was the only place that had the is_nonbare_repository_dir()
safety check -- resulting in us deleting both the untracked file and the tracked (and possibly dirty) file.
One possible fix for this issue would be walking the parent directories of each path and checking if they represent nonbare repositories, but that would be wasteful.
Even if we added caching of some sort, it's still a waste because we should have been able to check that "nested/" represented a nonbare repository before even descending into it in the first place.
Add a DIR_SKIP_NESTED_GIT
flag to dir_struct.flags
and use it to prevent fill_directory()
and friends from descending into nested git repos.
With this change, we also modify two regression tests added in commit 91479b9 ("t7300
: add tests to document behavior of clean and nested git", 2015-06-15, Git v2.6.0-rc0).
That commit, nor its series, nor the six previous iterations of that series on the mailing list discussed why those tests coded the expectation they did.
In fact, it appears their purpose was simply to test existing behavior to make sure that the performance changes didn't change the behavior.
However, these two tests directly contradicted the manpage's claims that two -f
's were required to delete files/directories under a nested git repository.
While one could argue that the user gave an explicit path which matched files/directories that were within a nested repository, there's a slippery slope that becomes very difficult for users to understand once you go down that route (e.g. what if they specified "git clean -f -d '*.c'
"?)
It would also be hard to explain what the exact behavior was; avoid such problems by making it really simple.
Finally, there are still a couple bugs with -ffd
not cleaning out enough (e.g. missing the nested .git
) and with -ffdX
possibly cleaning out the wrong files (paying attention to outer .gitignore
instead of inner).
This patch does not address these cases at all (and does not change the behavior relative to those flags), it only fixes the handling when given a single -f
.
See this thread for more discussion of the -ffd[X?]
bugs.
With Git 2.25.1 (Feb. 2020), a corner case bugs in "git clean
" that stems from a (necessarily for performance reasons) awkward calling convention in the directory enumeration API has been corrected.
See commit 0cbb605, commit ad6f215 (16 Jan 2020) by Jeff King (peff
).
See commit 2270533 (16 Jan 2020) by Elijah Newren (newren
).
See commit f365bf4 (16 Jan 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 7ab963e, 05 Feb 2020)
dir
: treat_leading_path()
and read_directory_recursive()
, round 2
Signed-off-by: Elijah Newren
I was going to title this "dir
: more synchronizing of treat_leading_path()
and read_directory_recursive()
", a nod to commit 777b42034764 ("dir
: synchronize treat_leading_path()
and read_directory_recursive()
", 2019-12-19, Git v2.25.0-rc0 -- merge), but the title was too long.
Anyway, first the backstory...
fill_directory()
has always had a slightly error-prone interface: it returns a subset of paths which might match the specified pathspec; it was intended to prune away some paths which didn't match the specified pathspec and keep at least all the ones that did match it.
Given this interface, callers were responsible to post-process the results and check whether each actually matched the pathspec.
builtin/clean.c
did this.
It would first prune out duplicates (e.g. if "dir
" was returned as well as all files under "dir/
", then it would simplify this to just "dir
"), and after pruning duplicates it would compare the remaining paths to the specified pathspec(s).
This post-processing itself could run into problems, though, as noted in commit 404ebceda01c ("dir
: also check directories for matching pathspecs", 2019-09-17, Git v2.24.0-rc0 -- merge listed in batch #8):
For the case of git clean
and a set of pathspecs of "dir/file
" and "more
", this caused a problem because we'd end up with dir entries for both: "dir
" and "dir/file
"
Then correct_untracked_entries()
would try to helpfully prune duplicates for us by removing "dir/file
" since it's under "dir
", leaving us with "dir
".
Since the original pathspec only had "dir/file
", the only entry left doesn't match and leaves nothing to be removed.
(Note that if only one pathspec was specified, e.g. only "dir/file
", then the common_prefix_len optimizations
in fill_directory
would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.)
That commit fixed the issue -- when multiple pathspecs were specified -- by making sure fill_directory()
wouldn't return both "dir
" and "dir/file
" outside the common_prefix_len
optimization path.
This is where it starts to get fun.
In commit b9670c1f5e6b ("dir
: fix checks on common prefix directory", 2019-12-19, Git v2.25.0-rc0 -- merge), we noticed that the common_prefix_len
wasn't doing appropriate checks and letting all kinds of stuff through, resulting in recursing into .git/ directories and other craziness.
So it started locking down and doing checks on pathnames within that code path.
That continued with commit 777b42034764 ("dir
: synchronize treat_leading_path()
and read_directory_recursive()
", 2019-12-19, Git v2.25.0-rc0 -- merge), which noted the following:
Our optimization to avoid calling into read_directory_recursive()
when all pathspecs have a common leading directory mean that we need to match the logic that read_directory_recursive()
would use if we had just called it from the root.
Since it does more than call treat_path()
, we need to copy that same logic.
...and then it more forcefully addressed the issue with this wonderfully ironic statement:
Needing to duplicate logic like this means it is guaranteed someonewill eventually need to make further changes and forget to update both locations.
It is tempting to just nuke the leading_directory
special casing to avoid such bugs and simplify the code, but unpack_trees
' verify_clean_subdirectory()
also calls read_directory()
and does so with a non-empty leading path, so I'm hesitant to try to restructure further.
Add obnoxious warnings to treat_leading_path()
and read_directory_recursive()
to try to warn people of such problems.
You would think that with such a strongly worded description, that its author would have actually ensured that the logic in treat_leading_path()
and read_directory_recursive()
did actually match and that everything that was needed had at least been copied over at the time that this paragraph was written.
But you'd be wrong, I messed it up by missing part of the logic.