How do store sensitive files in my repo without tracking it?

Question

I have an untracked json settings file containing sensitive authentication data required on my local development machine and on my production server. After every build deployment on the server, I have to painstakingly of edit it with the proper credential information.

I prefer to store the json file with empty fields in my repo so the credentials can be later filled out either by my build script, or filled out by another developer after the repo is initialized:

{
     "username": ""  // to be filled by build or user
     "password" : ""  // ditto gitto
}

Of course, the file should not be tracked for security reasons. I've encountered a few ways to achieve with Git in the past, but they all require the user to actively perform additional steps, such as adding the file to .ignore which poses a security risk if not performed. Has there been any new features in Git to address this issue?

You might want to check https://stackoverflow.com/questions/25401432/github-store-files-encrypted — Vítor Lourenço, Feb 24 '21 at 17:56
@evolutionxbox Filling it out is the easy part, but creating the file without having it tracked automatically is the part I am having difficulty with. — ATL_DEV, Feb 24 '21 at 18:07
That’s why using environmental variables are a good idea. They’re not added into the git repo — evolutionxbox, Feb 24 '21 at 18:17
I think this is the closest thing: git update-index --no-skip-worktree but I don't understand how it works. — ATL_DEV, Feb 24 '21 at 18:19
@Zoe, thanks but I don't want to save my credentials, but don't want my work dir version wiped out either, just want a template in the correct folder. — ATL_DEV, Feb 26 '21 at 21:26

score 1 · Accepted Answer · answered Feb 25 '21 at 00:39

The answer to the question in the subject line:

How do store sensitive files in my repo without tracking it?

is: you don't.

The reason is simple: Git builds new commits from whatever is in Git's index. The index, a.k.a. the staging area, holds the copies of files that will go into your next commit. It's initially filled in by copying out the files from the current commit. These same files are copied to your working tree so that you can see and work on them.¹ Then, as you modify your working tree copies, you run git add to copy the working tree versions back into Git's index, so that the proposed next commit is also updated.

A tracked file is one that is in Git's index. It is therefore proposed to be in your next commit. If you untrack the file (by removing it from Git's index), it is proposed that the next commit should omit that file, i.e., the file is deleted between the two commits.

The answer—well, an answer—to the question inside the text:

I prefer to store the json file with empty fields in my repo so the credentials can be later filled out either by my build script, or filled out by another developer after the repo is initialized:
{
     "username": ""  // to be filled by build or user
     "password" : ""  // ditto gitto
}

is to use Git's smudge and clean filter mechanism so that the stored file, in Git, omits the sensitive data, while the working tree copy of that same file—the data that you can see in a file-viewer and edit in an editor—shows it.

The smudge and clean filter mechanism is a little tricky, and carelessness can result in the sensitive data winding up in the repository.

I've encountered a few ways to achieve with Git in the past, but they all require the user to actively perform additional steps ...

Setting up the smudge and clean filters has this same problem. Once set up, though, the clean filter can take the working tree copy, which has the sensitive data, and strip that sensitive data out of the file-contents as the file is copied from your working tree into Git's index. So the proposed next commit does not have the sensitive data. The smudge filter can put the sensitive data back into the file as it's copied from a commit, or from Git's index, to your working tree copy. (Of course your smudge filter needs to get the sensitive data from somewhere. So: where are you keeping the actual data? Why not keep it there and only there?²)

In general, then, the right answer is: don't put this stuff in the repo at all. Instead of a json file that needs to be filled-in, supply an example (or "template") json file, or keep that data in some other file.

¹The difference between Git's index copy of a file, and your working tree copy of the same file, is ... well, see the smudge and clean filter stuff as well, but the important difference to Git itself is that the copy in Git's index is already in the special format that Git uses to store files. This format is compressed and de-duplicated and does not use the storage system that your OS uses. It can therefore hold files whose names your OS cannot pronounce, as it were, depending on your OS. It's also very fast to commit: it doesn't require scanning through the data to compress and de-duplicate it, for instance.

²Convenience, stubbornness, spite, obstinacy ... there are lots of good reasons!

(By the way, I mean this seriously. Git can be actively user hostile, especially to Git Newbies; I've devoted significant hours of my life to demystifying it because of this. Between Git and Hg, Hg clearly had a better UX, but it lost the Popularity Battle for whatever reason, and we're all in this thing now. As a "tech guy" I know I can't design user interfaces, but I can tell when we're in a tech nightmare.) — torek, Feb 25 '21 at 20:58
I've used Git many times and it has always confused the crap out of me to the point of extreme frustration. I always find myself relearning it every time I use it. Its proponents claim you only need to know a few commands, but you can tell they are not power users. Git is a poster child for leaky abstractions. It's interface is really a low level API exposed through a command-line interface. The GUIs don't offer much benefit over the CLI either. They merely convert the CLI commands into UI elements, but do provide a better visualization of your repo and its status. — ATL_DEV, Feb 26 '21 at 16:22
Mercurial is a ton easier to understand. It abstracts away unnecessary details. You don't need to understand it's internal data structures and algorithms. A merge in Mercurial is simple, but overwhelms you with different types and options: rebase, fast forward, us, theirs, etc. While it offers more control, it comes at the cost of a steep leaning curve. There's no reason, for instance, to have reset and a revert when a revert will suffice. Don't want to see the reverse commits? They shouldn't be shown anyway as they're unimportant. — ATL_DEV, Feb 26 '21 at 17:01

How do store sensitive files in my repo without tracking it?

1 Answers1