3

I have some files in a format (let's call them *.db files) that cannot be correctly diffed/merged by git. However, it is possible to transform each of these files into a file structure that git can handle correctly — for each of the db files we can generate a directory tree such that each file in that tree is a regular text file that can be diffed/merged by git (let's call this process 'deconstruction'). Furthermore, the contents of that directory tree can be combined together to (re)construct the original db file. This approach allows us to use standard git hosting with all the usual workflows (pull requests, auto-merge etc.) with these special files.

Now to the actual problem: I'd like to make all this completely transparent to the user. That is, any time the user stages a db file, I want the 'deconstruct' script to run automatically the transformed representation getting staged. Similarly, any time a checkout operation is run, I want the 'construct' script to run so that user gets the correct db file. This should also work with git add -a etc. I want all this to happen on the client side, as I cannot change the remote configuration (so custom merge tools are also out of the question).

It is ok if on the remote the data appears in the deconstructed form. In fact, ideally I'd like to see something like

data.db/
  1.txt
  2.txt
  ....
  n.txt

on the remote for a file data.db in a local repository — but I don't know if that is possible at all. This would mean that local git would be able create this deconstructed form in the staged area (and commit that) + reconstruct it to a db file, without the deconstructed form ever touching the actual working directory.

I assume that at least some aspects of the above would work (otherwise how do tools like git-lfs do it?), but I don't know what the limitations are and where to start looking. I am aware of the pre-commit etc. hooks, but I don't think they allow me to manipulate the staging area directly.

I would appreciate if someone would sketch a plan of attack on how to accomplish this workflow.

MrMobster
  • 1,851
  • 16
  • 25

1 Answers1

3

You should look at content filters :

see the "Attributes" chapter in the Pro Git book, scroll down to the content filter section.

I learned of their existence reading one of VonC's answer to a (much simpler) content question.

If you would go for diffing only, the attributes chapter also has an example of how to process binary files before diffing.


Reading git help gitattributes, and looking down at the details of the filter section :
when defining a filter, you can use %f to pass the file path as argument to your processing script.

You can use that to :

  • on clean :

    • build a target path from the original filename, e.g : expanded data for file foo.db will actually be stored under a directory named .foo.db.d/,

    • have the script build the expanded content and stage it,

    • replace the content of the actual staged file foo.db with the path of the expanded data .foo.db.d
  • on smudge:

    • read the path to the expanded content from file foo.db
    • combine the expanded content into a single foo.db file
    • somehow not checkout the .foo.db.d directory (not sure how to implement that part yet)

Still in git help attributes :

There also is a way to define a filter.<name>.process command (instead of two .clean and .smudge commands) which allow you to dig deeper into what git "sees" as content.

If you want to run down that rabbit hole : on Ubuntu I had to install the git-doc package to have access to the detailed documentation mentioned in the help page : technical/long-running-process.txt.

LeGEC
  • 46,477
  • 5
  • 57
  • 104
  • 1
    A 2013 answer... I was young back then, and knew little. Now, well, I am just older. Note: I have 179 of such answers (https://stackoverflow.com/search?q=user%3A6309+%22content+filter%22) – VonC Mar 24 '20 at 20:58
  • I did look at content filters but I was under impression that they are “just” filters - that is, a stream goes in, a stream goes out. How can I use them to split a file into multiple ones and via versa? – MrMobster Mar 24 '20 at 21:01
  • @MrMobster : you can read many interesting things in `git help gitattributes` ; I'm discovering them myself, and with the illustrations provided in the doc links, the technicalities make sense. – LeGEC Mar 24 '20 at 21:25
  • Unfortunately, this doesn't really work since one cannot manipulate the staging area from the filters. I have played around with combining filters with commit hooks but can't find a satisfying solution. Back to the drawing board it seems. – MrMobster Mar 25 '20 at 10:38
  • @MrMobster : is there something that prevents running `git add`, `git hash-object -w` or `git write-tree` from the filter scripts ? (I'm just curious) – LeGEC Mar 25 '20 at 12:24
  • I tried `git add` and it complains about git already running (index is locked) – MrMobster Mar 25 '20 at 12:58