0

Say that you have a 100MB text file, and you wish to commit changes to this file periodically to git. The changes are small and frequent.

Is there any efficient way of handling this with Git?

The normal way of staging and committing the file will cause git to read & write the entire file again, irrespective of how small your change is.

Is there a way of making a commit using only a "diff" of the changes?

Pragy Agarwal
  • 578
  • 1
  • 8
  • 22
  • What is your use case ? Log files ? – LeGEC May 20 '20 at 11:33
  • @LeGEC I'm trying to have version-control at database level for a note-taking application. I've built my own VCS for my usecase, but its workings are very similar to git. I therefore have this nagging thought that I should just use git since it has a huge ecosystem and wide tooling. – Pragy Agarwal May 20 '20 at 16:00
  • "note taking" : I imagine 100MB is the entire db ? not a single note ? If you dump each note into a separate file on disk, this would probably be a better match for git versioning – LeGEC May 20 '20 at 16:59
  • Out of curiosity : how many commits per second are you really looking at ? – LeGEC May 20 '20 at 17:01

3 Answers3

3

Is there any efficient way of handling this with Git?

No.

The hash ID of any Git object is a cryptographic checksum of its contents. You could speed up the computation a bit by having saved checksums for the first N megabytes, for instance, so that if you change some bytes 50 MB into the 100 MB object, you can compute the new blob object checksum by starting with the known 50 MB checksum and hence computing only about half as much of a checksum. But you'll still need to either store the entire loose object or implement your own pack-file algorithm as well.

Git is much better at handling a larger number of smaller files. For instance, instead of 1 100-MB file, you could store 1000 100-kB files. If you need to modify some bytes in the middle, you're then changing only a single file, or at most two files, each of which is smaller and will become a smaller loose object that can be summed relatively quickly.

torek
  • 448,244
  • 59
  • 642
  • 775
2

There are 2 formats of Git objects - Loose ones and Packed ones. When you initially add and commit file it adds another Loose object, which is a full blob. But Git can also turn this into Packed object (e.g. when pushing) which stores the diff. See answers here: What are the "loose objects" that the Git GUI refers to?.

After committing the file you can run git gc so that Git packs and removes old Loose object. Not sure if it would remove the old one right away or it will start doing this only after some time.

Stanislav Bashkyrtsev
  • 14,470
  • 7
  • 42
  • 45
  • Is there a way to directly commit the differences, without having to first commit the entire file, have git recompute its hash and store it? My motivation here is efficiency. I need to perform many commits per second, involving a very large file. The changes in each commit are very small – Pragy Agarwal May 20 '20 at 08:31
  • 1
    My guess that it's not possible. But let's wait and see if someone else has ideas. It sounds like Git might be wrong solution for your problem.. Also, you may want to work with Git over programming API (e.g. JGit), that could give you more control. And the last resort - is to write these files yourself in the format that Git expects :) – Stanislav Bashkyrtsev May 20 '20 at 08:36
  • 1
    I'm curious - what is the task that requires you to perform many commits per second? – 1615903 May 20 '20 at 08:51
  • @1615903 I'm trying to have version-control at database level for a note-taking application. I've built my own VCS for my usecase, but its workings are very similar to git. I therefore have this nagging thought that I should just use git since it has a huge ecosystem and wide tooling. – Pragy Agarwal May 20 '20 at 16:01
1

git will indeed read the entire content of the file to compute it's hash, for example, or when it diffs the file with another version.

For storage however : git already has a "diff" storage format. You can explicitly ask git to pack files by running git gc.


If you need performance :

  • use a program that computes the diff, and store only the diffs in git,
  • perhaps git is not the appropriate tool for your use case
LeGEC
  • 46,477
  • 5
  • 57
  • 104