22

I'm working on a project that involves the use of very sensitive data, and I've been instructed to only transmit this data online via a custom file transfer system. The project itself is under git source control and includes a sqlite file containing the sensitive data.

Up to this point, I've simply been ignoring the sqlite file via the gitignore file, which prevents it from ever being pushed to the remote repository. However, I've now reached a point in the project where we have a live version as well as a development version, and the fact that the data is not being tracked locally is making using branches very difficult.

So my question is: is there a way for me to keep track of the sqlite file locally, so I can have different data versions on different branches, but never have it pushed to the remote repository?

After reading this question, I considered having local-only development branches that use different gitignore files, but the fact that a git merge into the remotely shared branches would also merge changes to the gitignore file would quickly become cumbersome.

Community
  • 1
  • 1
Ell Neal
  • 6,014
  • 2
  • 29
  • 54
  • Answer [How do you handle sensitive data in a public git repo?](http://stackoverflow.com/questions/9556126/how-do-you-handle-sensitive-data-in-a-public-git-repo) apply here – Lazy Badger Mar 04 '12 at 17:15

3 Answers3

7

Ok, so I actually came up with a better solution to this problem. My previous solution, which involved a second git repository, quickly became problematic due to the size of the sqlite files I was working with; git cannot handle large files. I investigated various ways to improve git's ability to handle the files (e.g. git-bigfiles, git-annex) but nothing seemed to handle my situation elegantly.

The answer: symlinks.

N.B. This solution is pretty Unix specific, but you will probably be able to rework it for non-Unix systems.

Problem #1: Ensure that the data is never sent to the remote repository.

This one was easy. Similar to my previous solution, I store the data outside of the repository.

Root-Directory/
    My-Project/
        .git/
        Source-Code-and-Stuff/
    My-Project-Data/
        A-Big-Sqlite-File.sqlite

Because the data files aren't in the repository, there's no need to worry about them being indexed by git.

Problem #2: Different branches should reference different versions of the data.

This is where symlinks come into play. A symlink is effectively a shortcut to a file, so the idea is to put a symlink to the data file inside the repository. Symlinks are indexed by git (and they're very small), so different branches can have different symlinks.

To explain this, let's take an example project, which has a currently live version (1.1) on the master branch; and a new version (1.2) on the version-1.2 branch. For simplicity's sake, this project only has one data file: Data.sqlite.

The data file is stored inside the My-Project-Data directory mentioned above, and versioned on the filesystem like so:

My-Project-Data/
    v1.1/
        Data.sqlite
    v1.2/
        Data.sqlite

The data file is added to the repository by using a symlink:

My-Project/
    .git/
    Source-Code-and-Stuff/
        Data-Symlink.sqlite

On the master branch, Data-Symlink.sqlite is

../../My-Project-Data/v1.1/Data.sqlite

and on the version-1.2 branch it is

../../My-Project-Data/v1.2/Data.sqlite

So when development on version 1.3 begins, the following bash script will set everything up:

# Get to the root directory
cd path/to/Root-Directory
# Enter the data directory
cd My-Project-Data
# Make a directory for the new version and enter it
mkdir v1.3
cd v1.3
# Copy the new sqlite file into it
cp ~/path/to/data/file.sqlite Data.sqlite
# Move to the project directory
cd ../../My-Project
# Create a new branch
git checkout -b version-1.3
# Move to the source code directory and delete the current symlink
cd Source-Code-and-Stuff
rm Data-Symlink.sqlite
# Create a symlink to the new data file
ln -s ../../Project-Data/v1.3/Data.sqlite Data-Symlink.sqlite
# Commit the change
cd ../
git add Source-Code-and-Stuff/Data-Symlink.sqlite
git commit -m "Update the symlink"

Conclusion

Obviously this isn't a perfect solution. If you're working with a team, everyone on the team will need to have the same relative directories - symlinks are relative paths, so the absolute path to Root-Directory can change, but My-Project and My-Project-Data must exist within it. But my personal opinion is that the benefits outweigh this minor caveat. In the actual project I'm using this technique with I have an 800MB sqlite file for the data, and being able to switch between live and development branches and have my project automatically update the data file is priceless.

Community
  • 1
  • 1
Ell Neal
  • 6,014
  • 2
  • 29
  • 54
3

Track files locally, but never allow them to be pushed to the remote repository

You can't, really.

Git tracks snapshots of your repository. These snapshots are what's git pushed and git pulled - if a file's in the snapshot, it's (generally) going to be included in the git push etc.

Your best option is to use a git submodule to hold the sensitive data. This question goes into that solution in some detail.

Community
  • 1
  • 1
simont
  • 68,704
  • 18
  • 117
  • 136
  • As with the other answer, submodules are used when you're including another **remote** repository, but the data I'm using must be kept offline. – Ell Neal Mar 05 '12 at 12:17
  • what? You can use submodules with local repositories just fine. – Asherah May 08 '12 at 00:39
  • @Len another repository is not the answer, see my actual solution for something that solves the problem. – Ell Neal May 08 '12 at 11:09
  • @EllNeal: that's not what I'm debating here, you said "submodules are used when you're including another **remote** repository", and I said "You can use submodules with local repositories just fine". :) The answer is another question. – Asherah May 08 '12 at 13:58
0

I wanted to take a second to explain my solution to this problem:

I've created a root directory for my project: MyRootDirectory. Inside MyRootDirectory I have two directories called MyProject and MyProjectData. Both MyProject and MyProjectData are git repositories, where MyProject has a remote counterpart on github, and MyProjectData is a local only repository. In my project file (I'm using Xcode) I have references to the data files using a path like this: ../MyProjectData/MyDatabase.sqlite.

This result allows me to have development and master branches for both the data and the project; the data is included in the built product because it exists in the project index, but it is never pushed up to the remote repository as only its path is included in the local repository. Magic.

Ell Neal
  • 6,014
  • 2
  • 29
  • 54