2

I have been trying for ages to get this hashing thing for BitTorrent to work in Java but it always becomes wrong.

I have narrowed it down to a few lines of code where I'm 99% sure the problem is:

Bencode bencode = new Bencode(Charset.forName("UTF-8"));
byte[] fileBytes = new byte[33237];
Map<String, Object> dict = bencode.decode(fileBytes, Type.DICTIONARY);
Map infoMap = (Map) object.get("info");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BencodeOutputStream bos = new BencodeOutputStream(baos);
bos.writeDictionary(infoMap);
byte[] hash = DigestUtils.sha1(baos.toByteArray());

I have hardcoded the size of the array just to make sure the issue is not caused by a bunch of zeroes hanging around.

I have tried with both UTF-8 and US-ASCII.

I have tried using two different libraries for the bencoding so it's probably not there where the problem's at.

Edit: From the spec it seems that the info dict should be urlencoded as the info_hash. So I tried writing out the dictionary into a ByteArrayOutputStream and then do the sha1 hashing on the byte[] that ByteArrayOutPutStream is holding.

Will the DigestUtils.sha1method provide a URL encoder? Can't find any information on that.

Jesper
  • 472
  • 1
  • 4
  • 14
  • 1
    When in doubt, debug – Rogue Jul 11 '17 at 19:47
  • I have no idea what you're doing, but I looked up [the spec](https://wiki.theory.org/index.php/BitTorrentSpecification). Is it the `info_hash` you're trying to calculate? It says that should be the value of `info`, not of `pieces` – that other guy Jul 11 '17 at 19:57
  • @thatotherguy It seems that way when reading it, yes. Hasn't worked when I tried though. In your opinion would you assume the info map is already sha1 encoded? Seems that way from reading it but it's all quite vague at the same time. – Jesper Jul 11 '17 at 20:06
  • The info map is bencoded: "urlencoded 20-byte SHA1 hash of the value of the info key from the Metainfo file. Note that the value will be a bencoded dictionary". How do I run your example? You don't provide any input or output data – that other guy Jul 11 '17 at 20:43
  • Thanks. It seems that my code now is decoding the big dictionary, grabbing the decoded info dictionary, encoding it and then doing the sha1 hashing. Doesn't that seem like a reasonable flow to get it working? – Jesper Jul 11 '17 at 20:49
  • Save the string: `d4:infod6:lengthi1e4:name5:a.txt12:piece lengthi32768e6:pieces20:1234567890abcdefghijee` as *mini.txt* * Rename to *mini.torrent* * You have now created a bogus but valid torrent with the info_hash: `831F79C1C8358FCEB75496C3A81E113EA8147F13` * Add a line in you code that instead of hashing, prints the string * Run code and compare with above values – Encombe Jul 12 '17 at 00:03
  • A normal info dict contains binary data, so UTF-8 or ASCII won't do. It must be a type that can handle binary strings. The info_hash should only be URL encoded when sent to a tracker as a HTTP-get announce, NOT when the info_hash is calculated. – Encombe Jul 12 '17 at 00:18
  • @Encombe Thanks a lot for the replies! I created that file and used the above code except for changing the hash type to a `String` and it printed the correct hash. So, as you say, there must be something wrong with the encoding. Because now the file can be encoded in _UTF-8_ but when it has special symbols that's not possible. But how can I possibly do this with a binary string when the library requires an encoding? – Jesper Jul 12 '17 at 18:09
  • Tried it with both the libraries now that I have been using for _bencoding_. The simple example works with both but once I try a real file it's no longer working. The second library only uses _UTF-8_. Should I first try and convert the `byte[]` from the file to _UTF-8_ and then send it to the library or is there any standard _charset_ available that can handle this? – Jesper Jul 12 '17 at 20:18
  • Also tried removing everything except for the info dict in a torrent file and hashing it directly without bencoding but it is still not working. – Jesper Jul 12 '17 at 20:38
  • The value for the obligatory *pieces* key in the *info* dict will contain binary data, so it's a must that the code can handle that. – Encombe Jul 13 '17 at 00:56
  • Possible duplicate of [The torrent info\_hash parameter](https://stackoverflow.com/questions/10191480/the-torrent-info-hash-parameter) – the8472 Jul 20 '17 at 22:56

1 Answers1

1

The problem, as Encombe pointed out, was with the encoding. In the Bencode specification it talks about byte strings and this seems to point to it just being a stream of data without any encoding.

Both of the libraries I looked at converted all byte strings to some encoding so I wrote a Bencode library that only did the conversion when specifically asked to.

The code above is basically correct but here is the client code I am using now:

public void readManifest() throws IOException, Exception {
    byte[] fileBytes = FileUtils.readFileToByteArray(file);
    ByteArrayInputStream bis = new ByteArrayInputStream(fileBytes);
    BDecoder decoder = new BDecoder(bis, "UTF-8");
    BDict dict = decoder.decodeDict();
    Map<String, Object> valueMap = dict.getValue();
    infoMap = (Map<String, Object>) valueMap.get("info");
}

public String hash() throws Exception {
    if (hash == null) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        BEncoder encoder = new BEncoder(baos, "UTF-8");
        encoder.encodeDict(infoMap);
        hash = DigestUtils.sha1Hex(baos.toByteArray());
    }
    return hash;
}
Jesper
  • 472
  • 1
  • 4
  • 14
  • 1
    Good. You can verify that it can handle the [special case of a unsorted info-dict](https://stackoverflow.com/questions/19749085/calculating-the-info-hash-of-a-torrent-file/19800109#comment44844341_19800109) by saving `d4:infod4:name5:b.txt6:lengthi1e12:piece lengthi32768e6:pieces20:1234567890abcdefghijee` as *unsort_dict.torrent* It should have the info_hash: `34FCC6C1ACC8C8A56DE3C2EF20924043CC51685E` – Encombe Jul 21 '17 at 12:20
  • 1
    @Encombe I receive that hash, yes. I'm using a LinkedHashMap so everything is ordered the same way as it is read. – Jesper Jul 21 '17 at 12:47
  • Small note: While it is 100% correct that bencoded "strings" actually are arbitrary byte sequences (BEP52 clarifies this) one can use ISO 8859-1 as workaround to treat them as string because it 1:1 maps raw bytes to unicode codepoints (i.e. `char` values in java). But this is a hack because those `String` instances will contain garbled data if it was actually supposed to be UTF8. But still a useful hack if one has to pass around byte data through APIs that expect strings. – the8472 Jul 26 '17 at 09:49