Calculating size of String in "octets"

Question

I have issues calculating a string length. I have the following string:

30 ctime=1460687405.982514823\n

Which is formated by the following definiton:

An extended header shall consist of one or more records, each constructed as follows:

"%d %s=%s\n", length, keyword, value

The extended header records shall be encoded according to the ISO/IEC10646-1:2000 standard (UTF-8). The length field, blank, equals sign, and newline shown shall be limited to the portable character set, as encoded in UTF-8. The keyword and value fields can be any UTF-8 characters. The length field shall be the decimal length of the extended header record in octets, including the trailing newline.

The issue is the length field.

How to calculate the size? Sum byte values? The whole String.length()?

What are octets?

score 1 · Answer 1 · answered Apr 29 '16 at 23:33

1

An octet is an 8-bit byte.

A String in Java is a set of 16-bit characters. When you encode those 16-bit [unicode] characters into UTF-8 encoded bytes, the size of that string (which, by then, is likely a byte[]) is what is being asked for. See this question for more details on encoding a Java string into UTF-8.

answered Apr 29 '16 at 23:33

mpontillo

13,559
7
62
90

Love it when the length value has variable length, and must include the length of itself. If keyword and value is `test` and `X`, should the line be `9 test=X\n`, or `10 test=X\n`? – Andreas Apr 29 '16 at 23:40
That would be `10 test=X\n` by my reading. That will make a good unit test. ;-) – mpontillo Apr 29 '16 at 23:44
The whole tar standard I'm implementing is just a huge mess. Hell, there is at least 5 different tar variations available. – Gala Apr 29 '16 at 23:44
Heh, so I guess when you parse it you just assume it's a one-off and append the newline just because? – mpontillo Apr 29 '16 at 23:47
1

@Mike No, `9 test=X\n` is 9 ASCII bytes long (`\n` is one byte). `10 test=X\n` is 10 ASCII bytes long. The length value is correct in both cases. – Andreas Apr 29 '16 at 23:49
Ah, now I feel dense. Commented too quickly after just running `echo "10 test=X" | wc -c`. That *is* a fun aspect. ;-) – mpontillo Apr 29 '16 at 23:53

Andreas · Answer 2 · 2016-04-30T00:41:38.763

An octet means a byte, which means that you first have to convert the header text to bytes in UTF-8 encoding, and count the bytes.

You can do that by calling getBytes(Charset charset), specifying the UTF_8 charset.

Of course, the problem is that the length of the entire header depends on the number of digits needed to specify the length. The following code will assume that the header length will be a 2-digit number, and retry if that is not the case.

It means that if keyword and value is test and X, the result will be 10 test=X\n, even though 9 test=X\n would seem more appropriate.
If keyword and value is A and B, the result will be 6 A=B\n, as it should be, and the length will grow to 3, 4, 5, ... digits as needed.

private static byte[] buildExtendedHeader(String keyword, String value) {
    byte[] bytes = (' ' + keyword + '=' + value + '\n').getBytes(StandardCharsets.UTF_8);
    int len = bytes.length + 2; // let's assume 2-digit length
    for (;;) {
        byte[] lenBytes = Integer.toString(len).getBytes(StandardCharsets.US_ASCII);
        int realLen = lenBytes.length + bytes.length;
        if (len == realLen) {
            byte[] header = new byte[len];
            System.arraycopy(lenBytes, 0, header, 0, lenBytes.length);
            System.arraycopy(bytes, 0, header, lenBytes.length, bytes.length);
            return header;
        }
        len = realLen;
    }
}

TEST

byte[] header = buildExtendedHeader("LIKE A STAMPEDE", "À LA DÉBANDADE");
System.out.printf("%s%n%d octets:", new String(header, StandardCharsets.UTF_8).replace("\n", "\\n"), header.length);
for (byte b : header)
    System.out.printf(" %02x", b);

OUTPUT

36 LIKE A STAMPEDE=À LA DÉBANDADE\n
36 octets: 33 36 20 4c 49 4b 45 20 41 20 53 54 41 4d 50 45 44 45 3d c3 80 20 4c 41 20 44 c3 89 42 41 4e 44 41 44 45 0a

Nice! One more thing. What do you think, could i solve that problem with regular expressions? — Gala, Apr 30 '16 at 00:40
Regular expression has nothing to do with character encoding, so no. — Andreas, Apr 30 '16 at 00:42

Calculating size of String in "octets"

2 Answers2