The length of a compressed Java String is not equal to the content-length when it is sent as a WebSocket message

Question

I am trying to reduce bandwidth consumption by compressing the JSON String I am sending through the WebSocket from my Springboot application to the browser client (this is on top of permessage-deflate WebSocket extension). This scenario uses the following JSON String which has a length of 383 characters:

{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/signup"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}

To benchmark, I send both compressed and uncompressed String from the server like so:

Object response = …,

SimpMessageHeaderAccessor simpHeaderAccessor =
    SimpMessageHeaderAccessor.create(SimpMessageType.MESSAGE);
simpHeaderAccessor.setSessionId(sessionId);
simpHeaderAccessor.setContentType(new MimeType("application", "json",
    StandardCharsets.UTF_8));
simpHeaderAccessor.setLeaveMutable(true);
// Sends the uncompressed message.
messagingTemplate.convertAndSendToUser(sessionId, uri, response,
    simpHeaderAccessor.getMessageHeaders());

ObjectMapper mapper = new ObjectMapper();
String jsonString;

try {
    jsonString = mapper.writeValueAsString(response);
}
catch(JsonProcessingException e) {
    jsonString = response.toString();
}

log.info("The payload is application/json.");
log.info("uncompressed payload (" + jsonString.length() + " character):");
log.info(jsonString);

String lzStringCompressed = LZString.compress(jsonString);
simpHeaderAccessor = SimpMessageHeaderAccessor.create(SimpMessageType.MESSAGE);
simpHeaderAccessor.setSessionId(sessionId);
simpHeaderAccessor.setContentType(new MimeType("text", "plain",
    StandardCharsets.UTF_8));
simpHeaderAccessor.setLeaveMutable(true);
// Sends the compressed message.
messagingTemplate.convertAndSendToUser(sessionId, uri, lzStringCompressed,
    simpHeaderAccessor.getMessageHeaders());

log.info("The payload is text/plain.");
log.info("compressed payload (" + lzStringCompressed.length() + " character):");
log.info(lzStringCompressed);

Which logs the following lines in the Java console:

The payload is application/json.
uncompressed payload (383 character):
{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/signup"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}
The payload is text/plain.
compressed payload (157 character):
??????????¼??????????????p??!-??7??????????????????????????????????u??????????????????????·}???????????????????????????????????????/?┬R??b,??????m??????????

Then browser receives the two messages sent by the server and captured by this javascript:

stompClient.connect({}, function(frame) {
    stompClient.subscribe(stompClientUri, function(payload) {
        try {
            JSON.parse(payload.body);
            console.log("The payload is application/json.");
            console.log("uncompressed payload (" + payload.body.length + " character):");
            console.log(payload.body);

            payload = JSON.parse(payload.body);
        } catch (e) {
            try {
                payload = payload.body;
                console.log("The payload is text/plain.");
                console.log("compressed payload (" + payload.length + " character):");
                console.log(payload);

                var decompressPayload = LZString.decompress(payload);
                console.log("decompressed payload (" + decompressPayload.length + " character):");
                console.log(decompressPayload);

                payload = JSON.parse(decompressPayload);
            } catch (e) {
            } finally {
            }
        } finally {
        }
    });
});

Which displays the following lines in the browser's debug console:

The payload is application/json.
uncompressed payload (383 character):
{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/sign-up"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}
The payload is text/plain.
compressed payload (157 character):
ᯡࠥ䅬ࢀጨᎡ乀ஸ̘͢¬ߑ䁇啰˸⑱ᐣ䱁ሢ礒⽠݉ᐮ皆⩀p瑭漦!-䈠ᷕ7ᡑ刡⺨狤灣મ啃嵠ܸ䂃ᡈ硱䜄ቀρۯĮニᴴဠ䫯⻖֑点⇅劘畭ᣔ奢⅏㛥⡃Ⓛ撜u≂㥋╋ၲ⫋䋕᪒丨ಸ䀭䙇Ꮴ吠塬昶⬻㶶Т㚰ͻၰú}㙂᥸沁⠈ƹ⁄᧸㦓ⴼ䶨≋愐㢡ᱼ溜涤簲╋㺮橿䃍砡瑧ᮬ敇⼺ℙ滆䠢榵ⱀ盕ີ‣Ш眨રą籯/ሤÂR儰Ȩb,帰Ћ愰䀥․䰂m㛠ளǀ䀭❖⧼㪠Ө柀䀠 
decompressed payload (383 character):
{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/sign-up"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}

At this point I can now verify that whatever String value my Springboot application compresses, the browser can able to decompress and get the original String. There is a problem though. When I inspected the browser debugger if the size of the transferred message was actually reduced, it tells me that isn't.

Here is the raw uncompressed message (598B):

a["MESSAGE destination:/user/session/broadcast
content-type:application/json;charset=UTF-8
subscription:sub-0
message-id:5lrv4kl1-1
content-length:383

{"headers":{},"body":{"message":{"errors":{"password":"Password length must be at least 8 characters.","retype":"Retype Password cannot be null.","username":"Username length must be between 6 to 64 characters."},"links":[],"success":false,"target":{"password":"","retype":"","username":""}},"target":"/user/session/sign-up"},"statusCode":"UNPROCESSABLE_ENTITY","statusCodeValue":422}

While this is the raw compressed message (589B):

a["MESSAGE destination:/user/session/broadcast
content-type:text/plain;charset=UTF-8
subscription:sub-0
message-id:5lrv4kl1-2
content-length:425

á¯¡à ¥ä¬à¢á¨á¡ä¹à®¸ÌÍ¢Â¬ßäå°Ë¸â±á£ä±á¢ç¤â½Ýá®çâ©pçæ¼¦!-ä á·7á¡å¡âº¨ç¤ç£àª®ååµÜ¸äá¡ç¡±äáÏÛ¯Ä®ãá´´áä«¯â»Öç¹âåçá£å¥¢âã¥â¡âæuâã¥âá²â«äáªä¸¨à²¸ääá¤åå¡¬æ¶â¬»ã¶¶Ð¢\u2029ã°Í»á°Ãº}ãá¥¸æ²âÆ¹âá§¸ã¦â´¼ä¶¨âæã¢¡á±¼æºæ¶¤ç°²âãº®æ©¿äç¡ç§á®¬æâ¼ºâæ»ä¢æ¦µâ±çàºµâ£Ð¨ç¨àª°Äç±¯/á¤ÃRå°È¨b,å¸°Ðæ°ä¥â¤ä°mãà®³Çäââ§¼ãª Ó¨æä  \u0000"]

The debug console indicates that the uncompressed message was transferred with the size of 598B, with 383 character as the message payload's size (indicated by the content-length header). While on the other hand, the compressed message was transferred with a total size of 589B, 9B smaller than the uncompressed one, with 425 character as the message payload's size. I have several questions:

Is the content-length of the STOMP message indicated in bytes, or in characters?
Why does the content-length of the uncompressed message, which is 383, smaller than that of the compressed message, which is 425?
Does this mean reducing the character length does not always necessarily means reducing the size?
Why does the content-length of the compressed message, which is 425, not the same with the value returned in the Java console (using lzStringCompressed.length()) which is 157, considering that the uncompressed message was transferred with a content-length of 383, which is the same length in Java console. Both too are transferred with charset=UTF-8 encoding.
Why does the content-length of the compressed message, which is 425, not the same with value returned in the Java console (using lzStringCompressed.length()) which is 157 but the JavaScript code payload.length returns 157, not 425?
If it really gets bloated during the transfer, why does the message with application/json remained unaffected and only the plain/text gets bloated?

While the 9B difference is still a difference, I am reconsidering if the overhead cost for compressing/decompressing the message is worth to keep. I have to test other String values for that.

Please, can you indicate from which java library are you using ```LZString```? — jccampanero, Sep 21 '20 at 09:58
Compression algorithms do not always reduce the size of its input. A very simple example would be deflate-ing the string "ABCDEF", which turns its 6 bytes into 14 bytes. Also, smaller inputs do not equals smaller outputs: compare the former example vs deflate-ing the string "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" which is way longer at 50 bytes, but can be compressed to just 12 bytes. If your payloads are going to be short, heterogeneous strings (rather than long with lots of repetition), compression might not be worth it. — walen, Sep 21 '20 at 10:26
@jccampanero it is from a project that implements LZW compression. The javascript version is here: https://github.com/pieroxy/lz-string. While the java version is here: https://github.com/rufushuang/lz-string4java — Gideon, Sep 21 '20 at 10:40
Where does it establish that the data and connection is to be "binary", not JSON nor plain text nor utf-8? — Rick James, Sep 23 '20 at 00:01
@RickJames I do not configured the WebSocket to convert the String to binary before sending. IIRC the websocket will internally convert the POJO into a valid JSON string if the `content-type` is `application/json`, otherwise it will be sent to the client as plain string (I am not sure but most probably via the object's `.toString()` method). I attempted to sent a plain string as well (with `content-type` of `text/plain;charset=UTF-8`) to check if that too will be bloated but it is sent as is. I wonder if its the Japanese and Chinese glyphs that made it bloated, hmmmm. — Gideon, Sep 23 '20 at 04:40
Just a note aside: if You send gzipped content together with `"Content-Encoding: gzip"` i believe You can omit the decompression in the browser. — deblocker, Sep 23 '20 at 05:29
Do _not_ use "string" or utf-8" _anywhere_ for an index. It _must_ be binary throughout. But JSON has no "binary", so you _must_ convert to something. BASE64 is useful for such. — Rick James, Sep 23 '20 at 22:19

jccampanero · Answer 1 · 2020-09-26T21:48:44.777

All the questions are close related.

Is the content-length of the STOMP message indicated in bytes, or in characters?

As you can see in the STOMP specification:

All frames MAY include a content-length header. This header is an octet count for the length of the message body....

From a STOMP perspective the body is a byte array and the headers content-type and content-length determine what the body contains and how it should be interpreted.

Why does the content-length of the uncompressed message, which is 383, smaller than that of the compressed message, which is 425?

Because of the conversion to UTF-8 which is carried out when you send the information to the client in your STOMP server.

You have a message, a String, and this message is composed of a series of characters.

Without going into great detail - please, review this or this other one excellent answers if you need further information - internally, every char in Java is represented in Unicode code units.

To represent these Unicode code units in a certain character set, UTF-8 in your case, a variable number of bytes may be required, from one to four in your specific case.

In the case of the uncompressed message, you have 383 chars, pure ASCII, which will be encoded to UTF-8 with one byte per char. This is why you obtain the same value in the content-length header.

But it is not the case of the compressed message: when you compress your message, it will give you an arbitrary number of bytes, corresponding to 157 chars - Unicode code units - with arbitrary information. The number of bytes obtained will be less than the original message. But then you encode it in UTF-8. Some of these 157 chars will be represented with one byte, as was the case with the original message, but due to the arbitrariness of the information of the compressed message it is more likely that, in many cases, two, three or four bytes are necessary to represent some of them. This is the cause why you obtain a number of bytes greater than the number of bytes for the uncompressed message.

Does this mean reducing the character length does not always necessarily means reducing the size?

In general, you will always get a small size of information when you compress your data.

If the information is enough to make the use of compression worthwhile, and you have the ability to send the raw binary information compressed - similar to when a server sends information indicating Content-Encoding: gzip or deflate, it could bring you a great benefit.

But if the client library could only handle text messages and not binary ones, like SockJS for instance, as you can see the encoding problem may actually give you inappropriate results.

To mitigate the problem you can first try to compress your information to other intermediate encodings, like Base 64, which will give you roughly 1.6 times the number of bytes compressed: if this value is less than the number of bytes without compression, compressing the message may be worth it.

In any case, as indicated in the specification, STOMP is text based but also allows for the transmission of binary messages. Also, it indicates that the default encoding for STOMP is UTF-8, but it supports the specification of alternative encodings for message bodies.

If you are using, as your code suggests, stomp-js - please, be aware that I have not used this library, as the documentation indicates, it seems possible to process binary messages as well.

Basically, your server must send the raw bytes information with a content-type header with value application/octet-stream.

This information can be then processed in the client side by the library with something similar to this:

    // within message callback
    if (message.headers['content-type'] === 'application/octet-stream') {
      // message is binary
      // call message.binaryBody 
    } else {
      // message is text
      // call message.body
    }

If this works, and you can send the compressed information in this way, as indicated previously, the compression could bring you a great benefit.

Why does the content-length of the compressed message, which is 425, not the same with the value returned in the Java console (using lzStringCompressed.length()) which is 157, considering that the uncompressed message was transferred with a content-length of 383, which is the same length in Java console. Both too are transferred with charset=UTF-8 encoding.

Consider the Javadoc of the length method of the String class:

Returns the length of this string. The length is equal to the number of Unicode code units in the string.

As you can see, the length method will give you the number of Unicode code units required to represent the String, meanwhile the content-length header will give you the number of bytes required to represent them in UTF-8 as indicated previously.

In fact, calculating the length of the string could be a tricky task.

Why does the content-length of the compressed message, which is 425, not the same with value returned in the Java console (using lzStringCompressed.length()) which is 157 but the JavaScript code payload.length returns 157, not 425?

Because, as you can see in the documentation, length in Javascript also indicates the length of the String object in UTF-16 code units:

The length property of a String object contains the length of the string, in UTF-16 code units. length is a read-only data property of string instances.

If it really gets bloated during the transfer, why does the message with application/json remained unaffected and only the text/plain gets bloated?

As above mentioned, it has nothing to do with the Content-Type but with the encoding of the information.

The length of a compressed Java String is not equal to the content-length when it is sent as a WebSocket message

1 Answers1