0

I have a bunch of object ids referring to blobs in a given git repository. I would like to obtain the number of bytes that their uncompressed content occupies, preferably using JGit. That is, the number of bytes that the corresponding file will contain, once checked out in the workspace.

Is this information stored in the git blob itself? It is briefly discussed here but I do not understand if the blob size in the blob header corresponds to the size once inflated, or something else (such as the size required to store the delta).

I can access the blob size through JGit: given a FileRepository repository and having initialized once and for all an ObjectLoader reader = repository.newObjectReader(), it seems that the size I seek can be obtained using reader.open(objectId).getSize(). But this is slow. It often takes several tens of milliseconds to get a blob size. If I understand correctly, JGit reads the whole blob, at least in some cases. (I asked a similar question here but got no reply.)

My question is: can I get a blob size faster using JGit? Alternatively, can I achieve what I want at least in principle by reading some part of the blob data, that is, is this information stored somewhere in direct form, or deducible, or do I absolutely need to read and inflate the whole blob before knowing its size?

Olivier Cailloux
  • 977
  • 9
  • 24
  • 1
    The size *is* stored in the blob, in its header. It may, however, take many milliseconds to locate and read the blob header, especially if the object is packed. I have no idea whether any particular JGit code is good at finding the object's size without completely uncompressing the object first, or not, though (which is why this is a comment, not an answer). – torek Feb 01 '21 at 08:46
  • 1
    I tried `time git cat-file -s ` and `time (echo | git cat-file --batch-check)` on a 30mb blob. The latter is much faster. Besides, if repeated multiple times, the 1st time always takes much more time and the rest take much less. – ElpieKay Feb 01 '21 at 09:53
  • @torek Thanks. An answer not specific to JGit would be fine if nothing more direct appears. Any official reference about the size stored in the header? Do you know if uncompressing the whole object is required in order to access the relevant part of the header? – Olivier Cailloux Feb 01 '21 at 14:01
  • 1
    Using `git cat-file` with the `-s` or `--batch-check` option should be very fast. As @ElpieKay found, there's actually a bug in current (or maybe just older) versions of Git where `git cat-file -s` expands the entire blob internally even though it only needs the size, while the batch-check code doesn't; I think it's in the process of being fixed (I saw some messages on the Git mailing list but didn't keep track of them). – torek Feb 01 '21 at 14:55
  • 1
    The main problem with using `git cat-file` is that spawning a separate process is itself slow. Whether that will be a problem overall for your situation, I don't know. – torek Feb 01 '21 at 14:56
  • 1
    Oh, and, to answer the question you asked: it's necessary to read the blob header (the first however many bytes) to get the size, and this may require running a decompression algorithm, but there's no need to decompress the *entire* object: once we have run enough decompression to get the `blob \0` bytes, we're done. That's what the batch-check code does and that's why it's fast, and the fact that `git cat-file -s` doesn't *use* the fast path is the bug. – torek Feb 01 '21 at 14:58

1 Answers1

1

Use ObjectReader#getObjectSize, which will read only the size of the object and not the entire object.

opening the object will load it all into memory, which is unnecessary.

Edward Thomson
  • 74,857
  • 14
  • 158
  • 187