UTF-16 string compression implementation

Asked Aug 06 '20 at 12:39

Active Aug 06 '20 at 12:44

Viewed 107 times

C language/compression algorithm noob here, apologies in advance.

I am looking into a utf-16 string compression algorithm based on Lempel-Ziv as explained here http://www.unicode.org/notes/tn31/

According to the implementation (https://www.unicode.org/notes/tn31/#Performance), a 1014 byte string should be compressed to about 560 (about 60%).

However I downloaded the sample c (https://www.unicode.org/notes/tn31/utf16_compressor.tar.gz) code and tested compressing a string of 1290 length (I added a print statement to print the input and output lengths) but the output length is 3018 after compression. Is there something I am missing or am I misinterpreting the output length? From the code the output buffer of the compression function is an unsigned char (1 byte) array hence meaning the 3018 is actually 3018 bytes?

edited Aug 06 '20 at 12:44

Jabberwocky

48,281
17
65
115

asked Aug 06 '20 at 12:39

user3689913

1

It's possible for the output to be larger than the input if the input is not very compressible. – Ian Abbott Aug 06 '20 at 14:39
Thanks @IanAbbott I tried a string consisting the same character e.g. ttttttttttttttttttttttttt and yes the compressed output was indeed smaller than the input. Thanks for the pointer – user3689913 Aug 06 '20 at 16:09

UTF-16 string compression implementation

0 Answers0