0

First, I know my question has already been asked and dismissed as a Git abuse (see for example this question). I'll try to explain my use case.

I need to make periodically (cron job) special lengthy processing on a source version in the repository. In order not to disturb the repo, I extract the desired version. Unfortunately, there is nothing like export in Git. Documentation recommendes to use archive for that. Since I want an exact copy of the source tree (as it appears in the working directory), I must untar the archive in some target location.

In fact, this really looks like a checkout but checkout also changes the index, which then causes mayhem for the user surprised by the unexpected change.

Many answers recommend to clone the repo in this situation and then play innocuously on the clone. I don't want to proceed likewise because I must extract simultaneously many versions and I don't want to waste repo storage for each copy (think for example of monster repos like the Linux kernel).

I don't want either to use worktree because my copies will be badly tweaked and I don't want to incur the risk of any commit from these fancy copies back into the repo. The copies must be forgotten by Git as soon as they are made.

I finally implemented an export equivalent as a short script:

ref=$(git rev-parse --symbolic-full-name HEAD)
git --work-tree=<somewhere> checkout -f -q <branch_or_tag> -- '*'
git reset ${ref}

First line saves present position (commit id) in the repo. Second line checks out the desired version without changing HEAD but simultaneously sets the index to the checked out commit. Third line restores the initial position.

This works fine against a bare repository since you are not supposed to commit there, only to push or pull. Apart from the fact that you create an index file, the repo is apparently not disturbed.

However, if the script is run against a local repository (with an associated working directory), there is a small risk during its lifetime. checkout though fast is not instantaneous. The index is changed until reset completion. If any commit is attempted during this time frame, the repo will be damaged with faulty patches because the index is not what user expects.

Consequently, I ask again @schuess's question (see link above):

Is there a way to lock a Git repository to prevent any access?

The lock will be short-lived. It should protect against any state change in the repository.

Presently I live without but one day or later I will be caught; therefore I prefer to guard against this race condition.

Reminder: I am perfectly aware that I'm trying to play tricks on Git's back and I should not do that. A better solution would certainly be to implement a true export script not relying on checkout. See also above why I don't use clone.

ajlittoz
  • 414
  • 1
  • 5
  • 14
  • Thought about revoking permissions temporarily ? – Marged Jul 02 '17 at 15:08
  • @marged: Might be tricky because cron job may be running under same user/group – ajlittoz Jul 02 '17 at 15:23
  • You can checkout to a different directory (work tree). But that's equivalent to untarred archive. – phd Jul 02 '17 at 16:07
  • @phd: this is what I do with my "export" script but, as a side-effect, it changes the index and I need to revert it to its previous state. If the index can be left out of the way, my question is solved. – ajlittoz Jul 02 '17 at 16:34

1 Answers1

2

This answer comes in two parts.

How to get what you want

[using] git --work-tree=<somewhere> checkout -f -q <branch_or_tag> ...

works fine against a bare repository ...

Yes. There is a caveat or two:

Apart from the fact that you create an index file, the repo is apparently not disturbed.

In fact, this does not necessarily create it. There may already be an index. If there is an index, it uses the existing index to optimize the checkout. This can be good or bad.

Specifically, this is the technique some people use in deployment scripts: a push to the bare repository triggers a Git hook that uses git checkout to update the deployed deployment branch. The work-tree is supplied as a constant string, using --work-tree. The index tracks the content of that work-tree, and no other work-tree.

There is a good way to handle this issue, and it's the same one used by git worktree: assign one index per work-tree. That particular index then tracks only that one particular work-tree. As with any index, you can wipe it out entirely and let Git rebuild it later (if ever) as long as you are OK with losing any changes you have stored in it so far.

You can make your own index by making a unique path (e.g., mktemp) and setting that path into the environment GIT_INDEX_FILE, as described in the git front end documentation, section ENVIRONMENT VARIABLES.

[Edit: I've removed the second caveat, as I see you are using the non-HEAD-updating form of git checkout: git checkout <tree-ish> -- <paths>.]

Why (probably) none of this matters

You mention:

Since I want an exact copy of the source tree (as it appears in the working directory), I must untar the archive in some target location.

The cost of this, vs doing git checkout directly, is pretty low, at least on any modern system: git archive ... | tar -C path -xf - likely uses a bit more CPU than git checkout, but spends all its time waiting for disk I/O anyway. (The pipes use in-memory "I/O" and hence run at memory speed rather than I/O-device speed.) The only thing git archive does besides add a bit of overhead is obey any special archiving rules. These special rules are the advantage to using git archive, and of course these special rules are the disadvantage to using git archive.

Many answers recommend to clone the repo in this situation and then play innocuously on the clone. I don't want to proceed likewise because I must extract simultaneously many versions and I don't want to waste repo storage for each copy (think for example of monster repos like the Linux kernel).

A local clone (using path names, or using --local) uses hard links and therefore adds no extra space. This does assume that hard links are possible (i.e., that you are not moving across file systems).

You can also or instead use --shared to avoid copying the object database. You can even use --reference to obtain and share one copy across a network: i.e., the "main" repository can live on machine M (for master), and you keep a "reference copy" duplicate on your machine L (for local). You then use git clone --reference to make your temporary clone use the reference clone's object database. Both of these techniques assume that you will not remove objects from the --shared or --reference repository for the duration of the clone that is borrowing the object repository (this is where hard links are superior, as no such assumption is required for them).

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775
  • Thanks for the leads. I'll explore `--shared`or `--local`. As you have guessed, I only need read-only transparent (i.e. not disturbing) access to a repository, no `commit`, `rm` or other "writing" command. This can be the right direction. – ajlittoz Jul 02 '17 at 17:37
  • Hi @torek, I've read the manual about `--local`, `--shared` and `--reference`. They are all options to `clone`, meaning I duplicate the initial repository (with varying amount of copied data). This is one thing I want to avoid in order not to be confused with a plethora of repos. – ajlittoz Jul 03 '17 at 08:52
  • I'll give a try to the piped `archive` suggestion and measure how it behaves, time-wise. – ajlittoz Jul 03 '17 at 08:53
  • The --local option is default when cloning on a local file system. --local copies the object database using hard links. Even if the clone source hard link is deleted (very rare anyway), the file data lives on as long as at least one hard line is pointing to it, the destination clone hard link is guaranteed safe. Also, a fundamental property of git is that it does not change the content of existing objects in the database, instead it creates a new objects and adds them. – Craig Hicks May 11 '18 at 04:35
  • The result is that no matter what git actions are performed on the source clone after the "git clone" operation is completed, the destination clone remains uncorrupted. – Craig Hicks May 11 '18 at 04:36
  • @CraigHicks: all of the above is true. However, if the goal is to save *disk space*, note that even if you have hard links to packs and loose objects, if the new clone is *repacked*, you will increase disk space usage once the new clone replaces the existing packs and loose objects with its newly built pack. The destination remains intact but the `--local` has become ineffectual. (The same holds for `--shared` and `--reference` as well.) All of this is predicated on file system support for hard links. – torek May 11 '18 at 05:24
  • I guess it's true that ineffectual follows from either new or old compacting. But new is read only so generally no compaction there. Compacting is by rare, even on old. Calling compaction before cloning mught reduce the change further. It seems in this case new is temporary, so compaction is not usually an issue. – Craig Hicks May 11 '18 at 07:02