0

I'm doing some Java classes to read informations from Git object. Every class works in the same way: the file is retrieved using the repo path and the hash, then it is opened, inflated and read a line at time. This works very well for blobs and commits, but somehow the inflating doesn't work for tree objects.

The code I use to read the files is the same everywhere:

FileInputStream fis = new FileInputStream(path);
InflaterInputStream inStream = new InflaterInputStream(fis);
BufferedReader bf = new BufferedReader(new InputStreamReader(inStream));

and it works without issues for every object beside trees. When I try to read a tree this way I get this:

tree 167100644 README.mdDRwJiU��#�%?^>n��40000 dir1*�j4ކ��K-�������100644 file1�⛲��CK�)�wZ���S�100644 file2�⛲��CK�)�wZ���S�100644 file4�⛲��CK�)�wZ���S�

It seems that the file names and the octal mode are decoded the right way, while the hashes aren't (and I didn't have any problem decoding the other hashes with the above code). Is there some difference between the encoding of the hashes in tree objects and in the other git objects?

frollo
  • 1,296
  • 1
  • 13
  • 29

2 Answers2

2

The core of the problem is that there are two encoding inside a git tree file (and it isn't so clear from the documentation). Most of the file is encoded in ASCII, which means it can be read with whatever you like but the hashes are not encoded, they are simply raw bytes.

Since there are two differend encodings, the best solution is to read the file byte by byte, keeping in mind what's where.

My solution (I'm only interested in the name and hashes of the contents, so the rest is simply thrown away):

  FileInputStream fis = new FileInputStream(this.filepath);
  InflaterInputStream inStream = new InflaterInputStream(fis);
  int i = -1;
  while((i = inStream.read()) != 0){
      //First line
  }

  //Content data
  while((i = inStream.read()) != -1){
    while((i = inStream.read()) != 0x20){ //0x20 is the space char
      //Permission bytes
    }

    //Filename: 0-terminated
    String filename = "";
    while((i = inStream.read()) != 0){
      filename += (char) i;
    }

    //Hash: 20 byte long, can contain any value, the only way
    // to be sure is to count the bytes
    String hash = "";
    for(int count = 0; count < 20 ; count++){
      i = inStream.read();
      hash += Integer.toHexString(i);
    }
  }
frollo
  • 1,296
  • 1
  • 13
  • 29
0

OID's are stored raw in trees, not as text, so the answer to your question as asked in the title is "you're already doing it", and the answer to your question in the text is "yes."

To answer a why do it that way? follow-up, it's got its upsides and downsides, you hit a downside. Not much point talking about it, the pain/gain ratio on any change to that decision would be horrendous.

and read a line at time.

Don't Do That. One upside of the store-as-binary call is it breaks code that relies on never encountering an embedded newline much, much faster than would otherwise be the case. I recommend "if you misuse it or misunderstand it, it should break as fast as possible" as an excellent design rule to follow, right along with "be conservative in what you send, and liberal in what you accept".

jthill
  • 55,082
  • 5
  • 77
  • 137
  • So I should read a character at time? Your answer is't exactly clear on what the problem is... – frollo Feb 03 '17 at 10:16