0

Referencing:

I am attempting to decode git objects:

import zlib
import os

...
# current directory is .git/objects
    for current, subs, files in os.walk('.'):
        for filename in files:
            # in format ##/#{38}
            
            path = os.path.join(current, filename)[2:]

            # 'info/' and 'pack/' exist
            # don't worry about packed files

            with open(path, 'r') as file:
                
                # returns bytes object
                # assuming UTF-8 encoding (default) vs. legacy
                # https://git-scm.com/docs/git-commit#_discussion
                # .decode() also defaults to utf-8
                
                print(zlib.decompress(file.read()).decode())

however, this runs into a UnicodeDecodeError and the correct method looks something like:

with open(path, 'rb') as file:
  data = zlib.decompress(file.read())
  header, content = data.split(b'\0', 1)

which then reads it as binary data. In another related post, a commenter mentioned that rb does not decode at all, which seems inaccurate, as the binary string presented is human-readable, and I would like clarification on this as the documentation is fairly sparse.

I have found that strings read with rb must be referenced by a prefixed b to be binary strings. My question is: why does decoding it not work if git by default (and in this repository) uses UTF-8? How does it decode and present the binary string as human-readable format (i.e, b'This is a string' if it is unable to decode it?

user18348324
  • 244
  • 10
  • If the file has human-readable parts, then of course the string will have human-readable parts. You are not reading git source files. You are reading gzip archives. They most definitely are binary. If you open a file with `'r'`, it gets decoded through UTF-8 and becomes Python Unicode strings. If you open a file with `'rb'`, it is not decoded. A 4,000-byte file will become a 4,000-byte string. – Tim Roberts Mar 05 '23 at 06:01
  • These aren't "binary strings" or "binary data". We open the file in "binary mode", as opposed to "text mode", which was inherited from Windows. A file opened in "binary mode" produces "byte strings" that are not Unicode-encoded. They are raw; any byte sequence is possible. With a "text file", some byte sequences are not allowed, because they can't be converted to Unicode. – Tim Roberts Mar 05 '23 at 06:05
  • So how is it that after decompressing the gzip archive, the resulting representation can be partially presented as human-readable via binary string format? What decoding method does it use for that? In addition, if the uncompressed file was originally encoded in `UTF-8`, as I've been led to believe, why does `'r'` not work? – user18348324 Mar 05 '23 at 06:24
  • Do a hex dump of the file, see if that makes it more clear. The file YOU are opening is a gzip archive. NOTHING you read from that is human-readable. You are passing that file to the `zlib` decompressor. That library returns byte strings to you, because the contents COULD be other binary files, but if the contents are text, you can decode them to Unicode strings. – Tim Roberts Mar 05 '23 at 07:37
  • What is the relation to git and zlib in the tags? Please extract a [mcve], which will also tell you whether these are actuall relevant or just worth of a side-note. In any case, there is extensive documentation for `open()` (maybe even too much?) what part is unclear? You could also read the source code or study the history of commits. – Ulrich Eckhardt Mar 05 '23 at 08:22

2 Answers2

0

In the example I looked at the "content" contained data that could not be decoded as UTF-8.

Here is the test code I used:

from pathlib import Path
import zlib

git_file = Path.home().joinpath(
    "bluez", ".git", "objects", "c9",
    "4fdc6335829ab797dd06a6f0ac3fd123dd55a8")

data = zlib.decompress(git_file.read_bytes())
print(f"Raw {data}")
print(f"Raw as hex: {data.hex(' ')}")
# Decode on all data gives
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 27: invalid start byte
# data.decode()
header, content = data.split(b'\0', 1)
print(f"Header: {header} or as UTF8: {header.decode('UTF8')}")
print(f"Content decode replace errors:\n {content.decode('UTF8', errors='replace')}")

And here is the output it gave:

Raw b'tree 1854\x00100755 agent.py\x00W\xa7A\x83\xdf%\x96\x1f\xef\xac\xbf\xd4X\x11\xa45\xf2&e\xcb100644 bluezutils.py\x00 D\xe43)\x16\xccfy\xdf\xa4+\xe5\xae\xf7\x06\x8b\xd2\x18+100644 dbusdef.py\x00\xd3\x17\xc1\x8d\xe2\x82\xdd\x81\xdc\xef\x82\xa3\xac\x10\x08.\xbfV\x8e\xf2100644 example-adv-monitor\x00\xa4\x05\xfc{\x0e\x11\xfai\xa33\xcdn^\xce\xc7o\x14\xcfkO100755 example-advertisement\x00_\x02.\xe6v\x97\x0f\xf0\xac\xeb)\xc5\x85L\x8d`O\xc7\x97\x03100755 example-battery-provider\x00\x15"\xa5\xe0u\xca/\x04A\xcfu,U:]K\x10\xd8\xf7\xf4100644 example-endpoint\x00\x16e\x1ch:\x7f\xf2\xef\xba6\n\xff\x1b\xdd\x15\xda\xde\x85A\x9d100755 example-gatt-client\x00^k\xef\x9d{\x92\xb3\xb9\xc1\xe3\x8d7\xb2B\x1b\x13s\xaf\xd2\x9b100755 example-gatt-server\x00w#\x1c:\xd1\x02\xa2[AQ\x7f\x9bA|7\x1b{\xa04\xb0100644 example-player\x00\x14\x97\xd1\x10z\x16\x81H5/0\x08\xfc\t\xa1\x10\xd6\xba\x0c\x0b100755 exchange-business-cards\x00\x9a:\xa2\x9f\xb4v&\xa5\x156-B\xb2\x0cAB\x067C\x16100755 ftp-client\x00\xefuj\xb2\xb3\r\x92s*\x92\xeb(B\x8c\x02\xb7L\xaa\xfbc100755 get-managed-objects\x00Q%\xeeRG\xd8\x87\xc7\xb3\ntaQ\xafS\xee\xc1={\x95100755 get-obex-capabilities\x00\xa7\x98\nD%\x95i\xd4 \x10\xfc\x86aQ\xdc4\x02\xcbG\xe5100755 list-devices\x00\xb1\x12Ul0\xb2\n\xb6\xa6R7\x93\xda\x84\xda%]\xa87\xb2100755 list-folders\x00\xb4\xe3\xf1\x00\xb0\x96&\xda\x7f\xa5{\xbb\x1dY\xadn4l\xe3c100755 map-client\x00\xa2\xd9j\xe5\xf0\xea4\xf0\x16-\xe6Wk\xa6\x9f\xffm\xff"\xa8100755 monitor-bluetooth\x00\xa3\x97~ n\xec\xce\xd9\x1c\x1d\xa5\xafZHK\x1a\xcd\xda\xc0\x91100755 opp-client\x00O\x00\xa4\x1c\x01)\xea?\x14HD\x0e\xf1\xb1\xbf\r\xf6\xbe,\xc6100755 pbap-client\x00\xe6\xca\xfd\xd3\x01B!^\xf6\x16\xad?\xde\xfbPO\xe3\x03)0100644 sap_client.py\x00\xfe\xd1:\xed\xc8@\x16\x91Y\xf6\xfb\xab\x18d\xdfa[\xa2\x00\xb4100644 service-did.xml\x00R\xebh\xc0 \xabem\xf0f\xb6@E)\xff7u\xe4\x9dc100644 service-ftp.xml\x00\x1b\xda\x88W\xf5\xa8\x8e\x99p]\n\x05\xe7\xac\xb8\xd3\xb95L)100644 service-opp.xml\x005\x1bJA\n\xdf\x97s\xc4\xe7\xcb\xc4\x81\xb1\xc3Y\xd9\xf3\nF100644 service-record.dtd\x00\xf5;\xe5\xd0R\xd2Un\xc4\x07)\x1a\xf1\xc3vd+#\xaa;100644 service-spp.xml\x00+\x15l?\x03\x81\x88]\\W\xda\xad\x91\x81A\xcf\xf2k&"100755 simple-agent\x00O\xda\xff\x1e\xb7e\xa4\x96k\xa5\r0\xc2is>D\x06!\x8c100755 simple-endpoint\x00Y\xca\x18\x9c\xe5\x0eF\xd0\xfc?\xfb\xad\x0bKm\xe1\xf6\xa5B2100755 simple-obex-agent\x00\x06Om0\xb9\xeb\xb2\x84P\x91\xae\xc9\xcc\xbe\xad\xc2!\x98\x08\xc7100755 simple-player\x00\x92h(D\xd0\xf4\xef\\\xef\xef__6\xbe\x8b\x8aq\xd1\x8dF100755 test-adapter\x00\x961\xd9O\xe3\x11\xb4\xbb\xc3\xf7\x07&\xfb\x1e\xf5\xf8\xcd3\xfa~100755 test-device\x00\xa1\xe5\x08\x16gO\xda6\xf4\x82M\x8d\x88\xfb\x89\x82\x04iZ\xe1100755 test-discovery\x00\xec\xcc|~1\xf0p\xb4\x91\xe7\xd1\xa0\xe0u\x9e\xc2\x9f\xc0 \'100755 test-gatt-profile\x00\xa9s\xae\x14\xed1\x81wV\xe7\x0b\xd4[\x9c;\xaaKN\xa90100755 test-health\x00\xd6\xb47\xed\x88\xc5/\xc6(\xbd\x08\x14(\x9b<\x1d]v\xaf\x1a100755 test-health-sink\x00Wf]+\xa6I\xaf\xa1\x1b%\xeed\x1c\x0cI\xa7\x868\xa1b100755 test-hfp\x00\x11\xe3(\xe5L\xc8h"\x04UQ\xce8)\xa7\xc3\xc7\xa4-\xf6100644 test-join\x00\x96\x97\x95\tG@6\x8bD\xcd\xda6\r\xa44[\x8f\xd9\x9b\xde100755 test-manager\x00?\xa7 Z\x04\xb6\xa1\xdc\xd40\xab\xd1\xfd\xbc\xab!\xdd\x8bN\x96100755 test-mesh\x00\xfb\xf2Gk\xfd6\x15\x8f\xf2\xec\xb8\xd5\xee\xc2\xe1oYgR\x02100755 test-nap\x00\xd5\xc7W\xb7\x9d\xe1\x1e\xc7s0\xcb\xf8\x1d\xe8\x07\xaf\xae\x11.E100755 test-network\x00\xac\xc7\xdf\xf6^HVt\xf0\x085\xd6\x93\x84\x88@&\xe7\xb0k100755 test-profile\x00\xaf\x1e#\xf7e\xdd\xef8\x16\xe6(\xb4\x06\xaa\x91\x05\x93\xde\xc3\xed100755 test-sap-server\x00\xdd\xb1\xef\xe9\xbc\x8c\xb6\x84\xc1>\xa0VO&\x10\x11\xc7\xb3-\x86'
Raw as hex: 74 72 65 65 20 31 38 35 34 00 31 30 30 37 35 35 20 61 67 65 6e 74 2e 70 79 00 57 a7 41 83 df 25 96 1f ef ac bf d4 58 11 a4 35 f2 26 65 cb 31 30 30 36 34 34 20 62 6c 75 65 7a 75 74 69 6c 73 2e 70 79 00 20 44 e4 33 29 16 cc 66 79 df a4 2b e5 ae f7 06 8b d2 18 2b 31 30 30 36 34 34 20 64 62 75 73 64 65 66 2e 70 79 00 d3 17 c1 8d e2 82 dd 81 dc ef 82 a3 ac 10 08 2e bf 56 8e f2 31 30 30 36 34 34 20 65 78 61 6d 70 6c 65 2d 61 64 76 2d 6d 6f 6e 69 74 6f 72 00 a4 05 fc 7b 0e 11 fa 69 a3 33 cd 6e 5e ce c7 6f 14 cf 6b 4f 31 30 30 37 35 35 20 65 78 61 6d 70 6c 65 2d 61 64 76 65 72 74 69 73 65 6d 65 6e 74 00 5f 02 2e e6 76 97 0f f0 ac eb 29 c5 85 4c 8d 60 4f c7 97 03 31 30 30 37 35 35 20 65 78 61 6d 70 6c 65 2d 62 61 74 74 65 72 79 2d 70 72 6f 76 69 64 65 72 00 15 22 a5 e0 75 ca 2f 04 41 cf 75 2c 55 3a 5d 4b 10 d8 f7 f4 31 30 30 36 34 34 20 65 78 61 6d 70 6c 65 2d 65 6e 64 70 6f 69 6e 74 00 16 65 1c 68 3a 7f f2 ef ba 36 0a ff 1b dd 15 da de 85 41 9d 31 30 30 37 35 35 20 65 78 61 6d 70 6c 65 2d 67 61 74 74 2d 63 6c 69 65 6e 74 00 5e 6b ef 9d 7b 92 b3 b9 c1 e3 8d 37 b2 42 1b 13 73 af d2 9b 31 30 30 37 35 35 20 65 78 61 6d 70 6c 65 2d 67 61 74 74 2d 73 65 72 76 65 72 00 77 23 1c 3a d1 02 a2 5b 41 51 7f 9b 41 7c 37 1b 7b a0 34 b0 31 30 30 36 34 34 20 65 78 61 6d 70 6c 65 2d 70 6c 61 79 65 72 00 14 97 d1 10 7a 16 81 48 35 2f 30 08 fc 09 a1 10 d6 ba 0c 0b 31 30 30 37 35 35 20 65 78 63 68 61 6e 67 65 2d 62 75 73 69 6e 65 73 73 2d 63 61 72 64 73 00 9a 3a a2 9f b4 76 26 a5 15 36 2d 42 b2 0c 41 42 06 37 43 16 31 30 30 37 35 35 20 66 74 70 2d 63 6c 69 65 6e 74 00 ef 75 6a b2 b3 0d 92 73 2a 92 eb 28 42 8c 02 b7 4c aa fb 63 31 30 30 37 35 35 20 67 65 74 2d 6d 61 6e 61 67 65 64 2d 6f 62 6a 65 63 74 73 00 51 25 ee 52 47 d8 87 c7 b3 0a 74 61 51 af 53 ee c1 3d 7b 95 31 30 30 37 35 35 20 67 65 74 2d 6f 62 65 78 2d 63 61 70 61 62 69 6c 69 74 69 65 73 00 a7 98 0a 44 25 95 69 d4 20 10 fc 86 61 51 dc 34 02 cb 47 e5 31 30 30 37 35 35 20 6c 69 73 74 2d 64 65 76 69 63 65 73 00 b1 12 55 6c 30 b2 0a b6 a6 52 37 93 da 84 da 25 5d a8 37 b2 31 30 30 37 35 35 20 6c 69 73 74 2d 66 6f 6c 64 65 72 73 00 b4 e3 f1 00 b0 96 26 da 7f a5 7b bb 1d 59 ad 6e 34 6c e3 63 31 30 30 37 35 35 20 6d 61 70 2d 63 6c 69 65 6e 74 00 a2 d9 6a e5 f0 ea 34 f0 16 2d e6 57 6b a6 9f ff 6d ff 22 a8 31 30 30 37 35 35 20 6d 6f 6e 69 74 6f 72 2d 62 6c 75 65 74 6f 6f 74 68 00 a3 97 7e 20 6e ec ce d9 1c 1d a5 af 5a 48 4b 1a cd da c0 91 31 30 30 37 35 35 20 6f 70 70 2d 63 6c 69 65 6e 74 00 4f 00 a4 1c 01 29 ea 3f 14 48 44 0e f1 b1 bf 0d f6 be 2c c6 31 30 30 37 35 35 20 70 62 61 70 2d 63 6c 69 65 6e 74 00 e6 ca fd d3 01 42 21 5e f6 16 ad 3f de fb 50 4f e3 03 29 30 31 30 30 36 34 34 20 73 61 70 5f 63 6c 69 65 6e 74 2e 70 79 00 fe d1 3a ed c8 40 16 91 59 f6 fb ab 18 64 df 61 5b a2 00 b4 31 30 30 36 34 34 20 73 65 72 76 69 63 65 2d 64 69 64 2e 78 6d 6c 00 52 eb 68 c0 20 ab 65 6d f0 66 b6 40 45 29 ff 37 75 e4 9d 63 31 30 30 36 34 34 20 73 65 72 76 69 63 65 2d 66 74 70 2e 78 6d 6c 00 1b da 88 57 f5 a8 8e 99 70 5d 0a 05 e7 ac b8 d3 b9 35 4c 29 31 30 30 36 34 34 20 73 65 72 76 69 63 65 2d 6f 70 70 2e 78 6d 6c 00 35 1b 4a 41 0a df 97 73 c4 e7 cb c4 81 b1 c3 59 d9 f3 0a 46 31 30 30 36 34 34 20 73 65 72 76 69 63 65 2d 72 65 63 6f 72 64 2e 64 74 64 00 f5 3b e5 d0 52 d2 55 6e c4 07 29 1a f1 c3 76 64 2b 23 aa 3b 31 30 30 36 34 34 20 73 65 72 76 69 63 65 2d 73 70 70 2e 78 6d 6c 00 2b 15 6c 3f 03 81 88 5d 5c 57 da ad 91 81 41 cf f2 6b 26 22 31 30 30 37 35 35 20 73 69 6d 70 6c 65 2d 61 67 65 6e 74 00 4f da ff 1e b7 65 a4 96 6b a5 0d 30 c2 69 73 3e 44 06 21 8c 31 30 30 37 35 35 20 73 69 6d 70 6c 65 2d 65 6e 64 70 6f 69 6e 74 00 59 ca 18 9c e5 0e 46 d0 fc 3f fb ad 0b 4b 6d e1 f6 a5 42 32 31 30 30 37 35 35 20 73 69 6d 70 6c 65 2d 6f 62 65 78 2d 61 67 65 6e 74 00 06 4f 6d 30 b9 eb b2 84 50 91 ae c9 cc be ad c2 21 98 08 c7 31 30 30 37 35 35 20 73 69 6d 70 6c 65 2d 70 6c 61 79 65 72 00 92 68 28 44 d0 f4 ef 5c ef ef 5f 5f 36 be 8b 8a 71 d1 8d 46 31 30 30 37 35 35 20 74 65 73 74 2d 61 64 61 70 74 65 72 00 96 31 d9 4f e3 11 b4 bb c3 f7 07 26 fb 1e f5 f8 cd 33 fa 7e 31 30 30 37 35 35 20 74 65 73 74 2d 64 65 76 69 63 65 00 a1 e5 08 16 67 4f da 36 f4 82 4d 8d 88 fb 89 82 04 69 5a e1 31 30 30 37 35 35 20 74 65 73 74 2d 64 69 73 63 6f 76 65 72 79 00 ec cc 7c 7e 31 f0 70 b4 91 e7 d1 a0 e0 75 9e c2 9f c0 20 27 31 30 30 37 35 35 20 74 65 73 74 2d 67 61 74 74 2d 70 72 6f 66 69 6c 65 00 a9 73 ae 14 ed 31 81 77 56 e7 0b d4 5b 9c 3b aa 4b 4e a9 30 31 30 30 37 35 35 20 74 65 73 74 2d 68 65 61 6c 74 68 00 d6 b4 37 ed 88 c5 2f c6 28 bd 08 14 28 9b 3c 1d 5d 76 af 1a 31 30 30 37 35 35 20 74 65 73 74 2d 68 65 61 6c 74 68 2d 73 69 6e 6b 00 57 66 5d 2b a6 49 af a1 1b 25 ee 64 1c 0c 49 a7 86 38 a1 62 31 30 30 37 35 35 20 74 65 73 74 2d 68 66 70 00 11 e3 28 e5 4c c8 68 22 04 55 51 ce 38 29 a7 c3 c7 a4 2d f6 31 30 30 36 34 34 20 74 65 73 74 2d 6a 6f 69 6e 00 96 97 95 09 47 40 36 8b 44 cd da 36 0d a4 34 5b 8f d9 9b de 31 30 30 37 35 35 20 74 65 73 74 2d 6d 61 6e 61 67 65 72 00 3f a7 20 5a 04 b6 a1 dc d4 30 ab d1 fd bc ab 21 dd 8b 4e 96 31 30 30 37 35 35 20 74 65 73 74 2d 6d 65 73 68 00 fb f2 47 6b fd 36 15 8f f2 ec b8 d5 ee c2 e1 6f 59 67 52 02 31 30 30 37 35 35 20 74 65 73 74 2d 6e 61 70 00 d5 c7 57 b7 9d e1 1e c7 73 30 cb f8 1d e8 07 af ae 11 2e 45 31 30 30 37 35 35 20 74 65 73 74 2d 6e 65 74 77 6f 72 6b 00 ac c7 df f6 5e 48 56 74 f0 08 35 d6 93 84 88 40 26 e7 b0 6b 31 30 30 37 35 35 20 74 65 73 74 2d 70 72 6f 66 69 6c 65 00 af 1e 23 f7 65 dd ef 38 16 e6 28 b4 06 aa 91 05 93 de c3 ed 31 30 30 37 35 35 20 74 65 73 74 2d 73 61 70 2d 73 65 72 76 65 72 00 dd b1 ef e9 bc 8c b6 84 c1 3e a0 56 4f 26 10 11 c7 b3 2d 86

Header: b'tree 1854' or as UTF8: tree 1854
Content decode replace errors:
 100755 agent.py W�A��%�﬿�X�5�&e�100644 bluezutils.py  D�3)�fyߤ+����+100644 dbusdef.py ����݁��.�V��100644 example-adv-monitor ��{�i�3�n^��o�kO100755 example-advertisement _.�v���)ŅL�`OǗ100755 example-battery-provider "��u�/A�u,U:]K���100644 example-endpoint eh:��6
�s*��(B��L��c100755 get-managed-objects Q%�RG؇dz
taQ�S��={�100755 get-obex-capabilities ��
D%�i� ��aQ�4�G�100755 list-devices �Ul0�
��,�100755 pbap-client ����B!^��?��PO�)0100644 sap_client.py ��:��@�Y���d�a[� �100644 service-did.xml R�h� �em�f�@E)�7u�c100644 service-ftp.xml ڈW����p]
笸ӹ5L)100644 service-opp.xml 5JA
ߗs���ā��Y��
�4[�ٛ�100755 test-manager ?� Z����0�����!݋N�100755 test-mesh ��Gk�6�������oYgR100755 test-nap ��W����s0�����.E100755 test-network ����^HVt5֓��@&�k100755 test-profile �#�e��8�(�������100755 test-sap-server ݱ�鼌���>�VO&dz-�

As you can see the "content" section has information that could be converted to UTF-8 and some that cannot be represented as printable characters.

What often causes confusion when using python is that when displaying bytes Python will display any values in the ASCII range that are printable as their string equivalent.

For example, the following hex values b'\x42\x61\x64'

>>> print(b'\x42\x61\x64')
b'Bad'

This is an artefact of Python displaying the values, the content is still the same hex value. With Python when debugging binary data it is often worth printing it as a hex string:

>>> print(b'\x42\x61\x64'.hex(' '))
42 61 64

In the example above there are some values in the character range so it appears there are strings. For example, if we append \xff it is printed as the hex escaped hex value still as it cannot be represented by a character:

print(b'\x42\x61\x64\xff')
b'Bad\xff'

There is a good guide on Unicode & Character Encodings at: https://realpython.com/python-encodings-guide/

ukBaz
  • 6,985
  • 2
  • 8
  • 31
0

My question is: why does decoding it not work if git by default (and in this repository) uses UTF-8?

There is your incorrect assumption. The objects compressed and stored in .git/objects are arbitrary data. Often for source directories many of the objects are readable text, but not all of them are.

I initialized an empty directory with git init. I created a file foo with the content test\n. I committed foo with the commit message commit 1. So far all just readable text, as well as valid UTF-8. I now have four objects in .git/objects. One of them, for example, is a tree object, which contains a SHA-1 signature in binary, not in hex. 20 random binary bytes, which is essentially what a SHA-1 signature is, will almost never be valid UTF-8. I tried it, and found that the probability of being valid UTF-8 is about 10-5.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I see that commit objects are always encoded in UTF-8, but the hash for tree objects is not. Do you know why? – user18348324 Mar 05 '23 at 21:29
  • Hashes in tree objects are binary because they can be. It's fewer bytes. More importantly, because Linus decided that that's the format he wanted. – Mark Adler Mar 06 '23 at 01:24
  • Are you saying that the tree hashes are not the same 160-bit sha1 hash? – user18348324 Mar 06 '23 at 01:52
  • No. They most certaintly are 160-bit SHA-1 hashes. – Mark Adler Mar 06 '23 at 01:55
  • By the way, a commit _can_ in fact contain non UTF-8 data. I just made a commit with the comment being all of the byte values 1..255. I did get a nastygram from git though: "Warning: commit message did not conform to UTF-8." – Mark Adler Mar 06 '23 at 01:57
  • I see. Are you saying it's fewer bytes to store since there's no encoding? If so, how can `git cat-file` grab the contents of a `tree`, but a `zlib` decompression and subsequent decode cannot? – user18348324 Mar 06 '23 at 01:58
  • Yes. The SHA-1 is not encoded as hexadecimal, which would take 40 bytes. In binary it's 20 bytes. (Hexadecimal encoding has nothing whatsoever to do with UTF-8 encoding.) `git cat-file` is _interpreting_ the contents, not "decoding" it, in the sense of recasting a string of bytes at UTF-8. git knows the structure of one of its own trees, and it is electing to show it to you in a human-readable form. – Mark Adler Mar 06 '23 at 02:03
  • So you're saying that `git cat-file` is not possible through Python unless the object is `UTF-8` encoded, which git commits typically adhere to, but not trees? – user18348324 Mar 06 '23 at 02:04
  • What? No. You can trivially write your own `git cat-file` in Python by interpreting the contents yourself. You need to read about the structure of git objects if you want to work with them: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects – Mark Adler Mar 06 '23 at 02:05
  • I read that. You'll see I actually implemented the equivalent of the bottom of that page in Python. – user18348324 Mar 06 '23 at 02:06
  • it is in Ruby.. – user18348324 Mar 06 '23 at 02:11
  • Ah, sorry...... – Mark Adler Mar 06 '23 at 02:13
  • Besides the point. I was just trying to say that it seems like you cannot convert tree object contents into a human-readable one in Python, like `git cat-file` does, but you can with commits, and I wanted to know if you concur, as that seems to be what you've concluded as well. – user18348324 Mar 06 '23 at 02:15
  • I already said NO, I do not concur. You absolutely _can_ convert a tree object to a readable form using Python. – Mark Adler Mar 06 '23 at 02:19
  • Could you please suggest how? – user18348324 Mar 06 '23 at 02:34
  • Interpret the bytes. Display the SHA-1 as hex. It's `tree nnn\0`, followed by a series of these: `mode name\0` and then 20 bytes of SHA-1. – Mark Adler Mar 06 '23 at 03:25