0

I'm attempting to push my output files from Databricks to Github. (From my understanding, git integration with Databricks is only for notebooks, and not for other files such as CSV. When you add a Databricks repo, a dialog appears saying only db-notebooks are cloned.)

I can successfully push to Github once, but after pushing I can no longer commit again. #commitmentissues

The error is that git cannot append .git/logs/HEAD:

fatal: cannot update the ref 'HEAD': unable to append to '.git/logs/HEAD': Operation not supported

What I've done

  1. Initialize git from Databricks notebook: git init
  2. Tell git who I am: git config user.email "<email>" and git config user.name "<name>"
  3. Add and commit file: git add test.txt && git commit -m "message"

This works!

  1. Add remote: git remote add origin https://github.com/<user>/<repo>.git
  2. Push to remote. Did this from RStudio in Databricks (rather than notebook) so that I could interactively add Github username and personal access token: git push -u origin master

This works!

  1. Add a new file: git add file2.txt
  2. Commit: git commit -m "message"

This fails.

Error:

fatal: cannot update the ref 'HEAD': unable to append to '.git/logs/HEAD': Operation not supported

Why does pushing to Github change git's ability to append .git/logs/HEAD? How could I work around this?

Research

  • This question is also about trying to push to Github from Databricks but it fails at a different step in the process, and is using Databricks Git Integration, which I am not.
  • This Github issue returns the same error, but I got lost once they started talking about formats.
Unrelated
  • 347
  • 2
  • 14
  • I've realized this is not about git at all, but about appending files in Databricks. [It is not possible to append on mounted storage in Databricks](https://kb.databricks.com/dbfs/errno95-operation-not-supported.html), and all git commits after the first require appending. The question then becomes how to use git when you cannot append. – Unrelated Jul 11 '22 at 22:18
  • The short answer is "you can't": put the repository somewhere else, where Databricks' file system can't get in the way. Consider storing the *repository* in location A (where things work) and the *working tree* in location B (in Databricks) if that works for your case. If not, store both repository and working tree in a fully-capable file system, and occasionally *copy* the working tree *to* Databricks, when that's appropriate. – torek Jul 12 '22 at 00:05
  • @torek I think this is what I've been hoping to do since learning the issue was with appending on Databricks, but unsure how I'd go about separating the repository from the working tree – I've always understood the working tree to be within the repo – Unrelated Jul 12 '22 at 00:20
  • It's actually inverted: the repository is normally in the working tree! But Git does let you set them up separately; see `git init`'s `--separate-git-dir=` option. This requires a not-crazy-ancient Git version (1.7.5 or later, some people are still running on 1.7.x). – torek Jul 12 '22 at 00:26
  • This is fascinating and (untested) perfect! Thank you! @torek – Unrelated Jul 12 '22 at 00:37

1 Answers1

0

The problem, in the end, is not to do with git but with mounted storage on Databricks.

In the process of a git commit, git appends log files. Databricks, however, prohibits appending of files on mounted storage.

The solution, then, is to host the repository on unmounted storage (e.g. in /tmp, as suggested by the previous link).

@torek, in the question's comments, points out that the working tree could remain on mounted storage, with only the repo being hosted on unmounted storage, using git init's --separate-git-dir= option.

/tmp/
   project-repo
/dbfs/mnt/
   project-working-tree
Unrelated
  • 347
  • 2
  • 14