0

I am a bit confused with bencoding.

According to the specification when I bencode string I need to use the following format:

length:string

String spam becomes 4:spam

My question: 4 is qty of symbols of bencoded string, or qty of utf-8 bytes?

For instance, if I am going to bencode a string gâteau

What number should be specified as a length of this string?

I think I have to specify 7, and the final form should be 7:gâteau

It is because symbol â took 2 bytes accoring to utf-8 encoding, and all the rest symbols in this string took 1 byte according to utf-8 encoding.

Also I heard that it is not recommended to store bencoded data in java String instance.

In other words, when I bencode a data block, I should store it as a byte array and should not convert it to java String value to avoid encoding issues.

Are my assumptions correct?

  • 2
    From http://stackoverflow.com/tags/bencoding/info: *A byte string (a sequence of bytes, not necessarily characters) is encoded as :. [...] The specification does not deal with encoding of characters outside the ASCII set*. What is unclear? – JB Nizet Jul 14 '15 at 14:30
  • @JBNizet thank you. Please correct me if I am wrong. If I need to bencode a string with non-ascii characters, the `length` will show the qty of bytes, not characters. And for string `gâteau` the bencoded form will look like: `7:gâteau` as I described in my question. Am I right? –  Jul 15 '15 at 07:42
  • 1
    The specification, according to the text quoted in my comment, doesn't support non-ASCII characters. So you shouldn't be encoding `â` in the first place. But if you do, given that it says that it's a **byte** string, the length should be the number of bytes: 7. That's how I read it. – JB Nizet Jul 15 '15 at 09:07

1 Answers1

1

According to the specification, bencoded string is a sequence of bytes, and you have to specify qty of bytes for this sequence as its length.

And, from the specification: "All character string values are UTF-8 encoded".specification

And for your case with "gâteau" you should specify 7 as length, because character â takes 2 bytes.

  • Thank you for the detailed explanation :) –  Jul 24 '15 at 13:39
  • 2
    I would like to clarify that *"All character string values are UTF-8 encoded"* is a specific restriction only on *character strings* in .torrent(Metainfo) files. It does **not** apply to *bencoded byte strings* in general, which can contain any arbitrary form of raw byte strings. – Encombe Dec 30 '17 at 11:23