I am a bit confused with bencoding.
According to the specification when I bencode string I need to use the following format:
length:string
String spam becomes 4:spam
My question: 4 is qty of symbols of bencoded string, or qty of utf-8 bytes?
For instance, if I am going to bencode a string gâteau
What number should be specified as a length of this string?
I think I have to specify 7, and the final form should be 7:gâteau
It is because symbol â took 2 bytes accoring to utf-8 encoding, and all the rest symbols in this string took 1 byte according to utf-8 encoding.
Also I heard that it is not recommended to store bencoded data in java String instance.
In other words, when I bencode a data block, I should store it as a byte array and should not convert it to java String value to avoid encoding issues.
Are my assumptions correct?