10

We have a big git repository, which I want to push to a self-hosted gitlab instance.

The problem is that the gitlab remote does not let me push my repo:

git push --mirror https://mygitlab/xy/myrepo.git

This will give me this error:

Enumerating objects: 1383567, done.
Counting objects: 100% (1383567/1383567), done.
Delta compression using up to 8 threads
Compressing objects: 100% (207614/207614), done.
remote: error: object c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867: 
duplicateEntries: contains duplicate file entries
remote: fatal: fsck error in packed object    

So I did a git fsck:

error in tree c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867: duplicateEntries: contains duplicate file entries
error in tree 0d7286cedf43c65e1ce9f69b74baaf0ca2b73e2b: duplicateEntries: contains duplicate file entries
error in tree 7f14e6474400417d11dfd5eba89b8370c67aad3a: duplicateEntries: contains duplicate file entries

Next thing I did was to check git ls-tree c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867:

100644 blob c233c88b192acfc20548d9d9f0c81c48c6a05a66    fileA.cs
100644 blob 5d6096cb75d27780cdf6da8a3b4d357515f004e0    fileB.cs
100644 blob 5d6096cb75d27780cdf6da8a3b4d357515f004e0    fileB.cs
100644 blob d2a4248bcda39c0dc3827b495f7751b7cc06c816    fileC.xaml

Notice that fileB.cs is displayed twice, with the same hash. I assume that this is the problem, because why would the file be two times in the same tree with the same file name and blob hash?

Now I googled the problem but could not find a way how to fix this. One seemingly good resource I found was this: Tree contains duplicate file entries

However, it basically comes down to using git replace which does not really fix the problem, so git fsck will still print the error and prevent me from pushing to the remote.

Then there is this one which seems to remove the file entirely (but I still need the file, but only once, not twice in the tree): https://stackoverflow.com/a/44672692/826244

Is there any other way to fix this? I mean it really should be possible to fix so that git fsck does not throw any errors, right? I am aware that I will need to rewrite the entire history after the corrupted commits. I could not even find a way to get the commit that points to the specific trees, otherwise I might be able to use rebase and patching the corrupted commit or something. Any help would be greatly appreciated!

UPDATE: Pretty sure I know what to do, but not yet how to do it:

  1. Creating a new tree object from the old tree, but corrected with git mktree <- done
  2. Create a new commit that is identical to the old one that references the bad tree but with the newly fixed tree <- difficult, I cannot easily get the commit to the tree, my current solution runs like an hour or more and I do not know how to create the modified commit then, once I have found it
  3. Run git filter-branch -- --all <- Should persist the replacements of the commits

Sadly I cannot just use git replace --edit on the bad tree and then run git filter-branch -- --all because filter-branch seems to only work on commits, but ignores tree-replaces...

Tim
  • 2,051
  • 17
  • 36
  • What OS and Git version do you have on your side (client) and GItLab side (server, unless self-hosted also means self hosted on the same computer)? – VonC May 31 '19 at 11:59
  • git version 2.21.0 on windows, gitlab 11.9 on a linux, not sure which one. But the problem is reproducible on windows and linux, checking out with `git clone --mirror`, then running `git fsck` – Tim May 31 '19 at 15:01
  • So the repository itself is corrupted, apparently. – VonC May 31 '19 at 15:37
  • Yes, and I want to fix that, if possible – Tim May 31 '19 at 17:19
  • I don't see in your answer a `git show ` mentioned in https://stackoverflow.com/a/24868719/6309: that would at least show the duplicate entry. – VonC Jun 02 '19 at 04:56
  • The `git show c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867` will just print out the same as `git ls-tree c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867`, but just the file-names, not the blob id and type – Tim Jun 02 '19 at 08:05
  • OK. `fileB.cs` is duplicated then. Not sure then, considering I wrote https://stackoverflow.com/a/44672692/826244: maybe apply it anyway, saving each commit created where the file is deleted, and then filter branch it again, to add it back on each of those commits... – VonC Jun 02 '19 at 15:12
  • Ok so if I understand this correctly I would have to remove the files entirely from the repository and then basically add it again in the relevant branches by adding a commit, right? The problem is that I would have to save the current state of the file in all our release tags and most likely the project will not be able to build for all commits where the file is missing. Is this correct so far? If possible I would like to avoid leaving the repo not buildable for most of the commits in the last half year... – Tim Jun 02 '19 at 18:23
  • The idea would be to remove first, then add back the file in a second steps, all done in a local repository for testing: nobody else would be exposed to an "incomplete" (not compiling) repository. The challenge is to save the state of one of the two `fileB.cs` as well as the new commit created by their removal. That way, the second step would, for each of those new commits, modify them by adding back (once) the matching `fileB.cs`, resulting with a tree with only one `fileB.cs` per affected commit. – VonC Jun 02 '19 at 18:37
  • Alright then, so I remove the files (so far should be no problem with bfg or git filter-branch), but how would I add the files in the second step? As all following commits will have to change I will have to work with filter-branch, right? Like this? https://stackoverflow.com/a/54200033/826244 Will the cp then be executed only for a single commit or for all commits starting at the start point? Another problem would then be that the files in question has changed very often in the last 6 month, so another solution would be preferred – Tim Jun 02 '19 at 18:48
  • Agreed, what I have in mind is not trivial to implement: 2 step process of git filter-branch, one to remove, one to add (and yes, add in every commit, adding the right content for that file, as saved during the first step) – VonC Jun 02 '19 at 18:49
  • What was the final used solution for this issue? – filipe Jun 11 '19 at 10:04
  • I will post an update later, but basically I unpacked the pack files and then wrote a tool to fix the defective trees, their commits and all later commits. Will upload the tool to github in a few days so that anyone could easily fix it – Tim Jun 11 '19 at 13:02

4 Answers4

1

You can try running git fast-export to export your repository into a data file, and then run git fast-import to re-import the data file into a new repository. Git will remove any duplicate entries during the fast-import process, which will solve your problem.

Be aware that you may have to make a decision about how to handle signed tags and such when you export by passing appropriate arguments to git fast-export; since you're rewriting history, you probably want to pass --signed-tags=strip.

bk2204
  • 64,793
  • 6
  • 84
  • 100
1

The final solution was to write a tool that tackles this problem.

First step was to git unpack-objects all packfiles. Then I had to identify the commits that pointed to the tree entries with duplicates by reading all refs and then walking back in history checking all the trees. After I had the tools for that it was not so hard to now rewrite the trees of those commits and then rewriting all commits after that. After that I had to update the changed refs. This is the moment where I thoroughly tested the result as nothing was lost yet. Finally a git reflog expire --expire=now --all && git gc --prune=now --aggressive rewrote the pack and removed all loose objects that are not accessible anymore.

When I have the time I will upload the source code to github, as it performs really well and could be a template to similar problems. It ran only a few minutes on a 3.7GB repository (about 20GB unpacked). By now I also implemented reading from the packfiles, so no need to unpack anything anymore (which takes a lot of time and space).

Update: I worked a little more on the source and it now performs really well, even better than bfg for deleting a single file (no option switches yet). The source code is available here: https://github.com/TimHeinrich/GitRewrite Be aware, this was only tested against a single repository and only under windows on a core i7. It is highly unlikely that it will work on linux or with any other processor architecture

Tim
  • 2,051
  • 17
  • 36
  • Added the link to the sources – Tim Jun 22 '19 at 11:53
  • Perfect, thank you! It should be possible to crosscompile your tool to other platform. – VonC Jun 22 '19 at 12:04
  • Yes, I am just not sure if I got the byte order checks right for other platforms as well as if I used platform independent path seperators everywhere – Tim Jun 22 '19 at 13:14
0

You can delete the related refs and expire its objects.

In order to find the related refs run:

$ git log --all --format=raw --raw -t --no-abbrev

and search for the change sha, then find it in $ git show-refs

Next, for each ref holding the bad objects do:

$ git update-ref -d refs/changes/xx/xxxxxx/x

Finally expire the objects and run fsck, it should be fixed.

$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive
$ git fsck
filipe
  • 1,957
  • 1
  • 10
  • 23
  • I still need all the refs and its objects. I just have to change the objects so that they are not invalid anymore. – Tim Jun 07 '19 at 13:47
  • @Tim, did you try to use the rebase command instead ? git rebase -fr (https://git-scm.com/docs/git-rebase) – filipe Jun 07 '19 at 14:00
  • The idea is to create a branch from a previous commit and then rebase it from the source branch with -fr options. – filipe Jun 07 '19 at 14:20
0

I found an issue related with gitlab not having fsck.skipList and I think the solution may apply:

In order to push to a new project in gitlab, the guy used the import feature when creating that GitLab project, and had it import straight from his other repo.

Note: It didn't fix it local, but allowed to import it and maybe importing that way had generated a clean branch remotely.

filipe
  • 1,957
  • 1
  • 10
  • 23