Convert UTF-8 String with only 8 Bits per Character

Question

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

Do you have any other requirements on the encoding, other than that there is no charCode greater than 255? Is it allowed to have quotation marks, spaces, non-printable characters, NUL characters? — Paul, Jun 16 '16 at 19:22
Fast and as small as possible are somewhat mutually exclusive. You could try LZW compression of the string. Just how large is the string you want to compress, and why do you need to compress it? E.g. if it is for a GET request, perhaps you could use a POST request instead, which would transmit the bytes quite effectively. — Andrew Morton, Jun 16 '16 at 19:32
You could convert each characters charcode to base 255 and then delimit them with the one unused character. — Paul, Jun 16 '16 at 19:32
@AndrewMorton I'm using a compression library that encodes an object into a binary buffer. That library assumes each character of thestrings within the object fit in 1 byte. — RainingChain, Jun 16 '16 at 19:35
@RainingChain It may be time to consider using a different compression library. Or [How to convert a String to Bytearray](http://stackoverflow.com/questions/6226189/how-to-convert-a-string-to-bytearray). Or [String compression in JavaScript](http://stackoverflow.com/q/4570333/1115360). — Andrew Morton, Jun 16 '16 at 19:39
@RemcoGerlich Is there a way to get the binary representation of a UFT8 string in JS? — RainingChain, Jun 16 '16 at 19:48
@RainingChain The current marked answer separates `á` (a valid ASCII character) into `Ã¡`, wich is 4 bytes. I thought that was your problem — Bálint, Jun 16 '16 at 20:43
@RainingChain Do you *have* to end up with a string, or would an array of bytes be usable? — Andrew Morton, Jun 16 '16 at 20:52
"UTF-8 character": Your terminology is a bit off and could be standing in the way. UTF-8 is an encoding for the Unicode character set. UTF-16 is a different encoding for Unicode. It happens to be the one that JavaScript (and Java, .NET …) uses. UTF-16 could have 16 or 32 bits per character. UTF-8 could have 8, 16, 24 or, 32 bits per character. — Tom Blodget, Jun 17 '16 at 05:28

score 2 · Accepted Answer · answered Jun 16 '16 at 19:50

2

What you want to do is encode your string as UTF8. Googling for how to do that in Javascript, I found http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html , which gives:

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function decode_utf8( s ) {
  return decodeURIComponent( escape( s ) );
}

or in short, almost exactly what you found already, plus unescaping the '%xx' codes to a byte.

answered Jun 16 '16 at 19:50

RemcoGerlich

30,470
6
61
79

Hmm, I tried a long string and it outputted a character, wich is NOT on the extended ASCII table – Bálint Jun 16 '16 at 20:22
1

what is an "extended ASCII table"? – RemcoGerlich Jun 16 '16 at 21:16
The extended ASCII table is the ASCII table that contains the char codes from 0-255 instead of 0-127 (wich is the normal ASCII table) – Bálint Jun 16 '16 at 21:17
1

There is no such thing as THE ascii table that contains the char codes from 0-255. There are many many different such tables (not called ASCII though), for different languages and such. ISO 8859 ones (like latin-1), Windows codepages, etc. There is only one ASCII table and that is the one from 0 to 127. – RemcoGerlich Jun 16 '16 at 21:20
@Bálint: that table is wrong. You could just as well link to http://cs.stanford.edu/people/miles/iso8859.html , or https://en.wikipedia.org/wiki/Windows-1252#Code_page_layout , or whatever. I don't actually know what encoding that page you linked to is, but it's not "extended ASCII". – RemcoGerlich Jun 17 '16 at 06:48
Ah, your table is in fact https://en.wikipedia.org/wiki/Code_page_437 . That's from the original DOS PCs, man! The only relevance they still have is that you can still use them to type in special characters in Windows (for compatibility with DOS), otherwise that encoding is obsolete. – RemcoGerlich Jun 17 '16 at 06:51
Note: UTF-8 is a dynamic length encoding which makes sense for Unicode with 4 bytes per character. JavaScript strings however have only 2 bytes per character for which UTF-8 produces 1 or 3 byte encodings. There might be shorter encodings depending on OPs strings. – le_m Jun 17 '16 at 14:05

Bálint · Answer 2 · 2016-06-16T20:13:14.340

1

You can get the ASCII value of a character with .charCodeAt(position). You can split a character into multiple characters using this.

First, get the char code for every character, by looping trough the string. Create a temporary empty string, and while the char code is higher than 255 of the current character, divide 255 from it, and put a ÿ (the 256th character of the extended ASCII table), then once it's under 255 use String.fromCharCode(charCode), to convert it to a character, and put it at the end of the temporary string, and at last, replace the character with this string.

function encode(string) {
    var result = [];
    for (var i = 0; i < string.length; i++) {
    var charCode = string.charCodeAt(i);
        var temp = "";
        while (charCode > 255) {
            temp += "ÿ";
            charCode -= 255;
        }
        result.push(temp + String.fromCharCode(charCode));
    }
    return result.join(",");
}

The above encoder puts a comma after every group, this could cause problems at decode, so we need to use the ,(?!,) regex to match the last comma from multiple commas.

function decode(string) {
    var characters = string.split(/,(?!,)/g);
    var result = "";
    for (var i = 0; i < characters.length; i++) {
        var charCode = 0;
        for (var j = 0; j < characters[i].length; j++) {
            charCode += characters[i].charCodeAt(j);
        }
        result += String.fromCharCode(charCode);
    }
    return result;
}

edited Jun 16 '16 at 20:13

answered Jun 16 '16 at 19:36

Bálint

4,009
2
16
27

That looks good but that's only half the answer. I'd need the `decode` function too. And I believe trying to encode `"ÿ"` could cause a problem. – RainingChain Jun 16 '16 at 19:41
Current code doesn't work. Try `decode(encode('ĉ')) == 'ĉ'`. I'll go with RemcoGerlich answer as it's also a lot faster. Thanks anyway. – RainingChain Jun 16 '16 at 20:09
@RainingChain Huh, I know what's the problem, lemme fix it – Bálint Jun 16 '16 at 20:11
@RainingChain Now it works, I don't know where you got the "other is faster" part from – Bálint Jun 16 '16 at 20:15
`encodeURIComponent` and `unescape` is native code which is fast. Your solution uses string concatenation and `.split` which are slow. – RainingChain Jun 16 '16 at 21:58
@RainingChain `encodeURIComponent` and `unescape` are actually pretty slow. @Bálint's code might even be faster. I added a performance comparison to my answer. – le_m Jun 17 '16 at 00:16
@RainingChain You see, native functions do more than you need, and they're optimiued for longer cases. I don't think you need to encode and decode a full page of text. – Bálint Jun 17 '16 at 00:21

le_m · Answer 3 · 2016-06-17T12:55:05.107

UTF-8 is already an encoding for unicode text that uses 8 bits per character. You can simply send the UTF-8 string over the wire.

Generally, JavaScript strings consist of UTF-16 characters.

For such strings, you can either encode each UTF-16 character as two 8-bit characters or use a dynamic length encoding such as UTF-8.

If you have many non-ASCII characters, the first might produce smaller results.

// See http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}

function encode_fixed_length(s) {
  let length = s.length << 1,
      bytes = new Array(length);
  for (let i = 0; i < length; ++i) {
    let code = s.charCodeAt(i >> 1);
    bytes[i] = code >> 8;
    bytes[++i] = code & 0xFF;
  }
  return String.fromCharCode.apply(undefined, bytes);
}

function decode_fixed_length(s) {
  let length = s.length,
      chars = new Array(length >> 1);
  for (let i = 0; i < length; ++i) {
    chars[i >> 1] = (s.charCodeAt(i) << 8) + s.charCodeAt(++i);
  }
  return String.fromCharCode.apply(undefined, chars);
}

string_1 = "\u0000\u000F\u00FF";
string_2 = "\u00FF\u0FFF\uFFFF";

console.log(encode_fixed_length(string_1)); // "\x00\x00\x00\x0F\x00\xFF"
console.log(encode_fixed_length(string_2)); // "\x00\xFF\x0F\xFF\xFF\xFF"

console.log(encode_utf8(string_1));         // "\x00\x0F\xC3\xBF" 
console.log(encode_utf8(string_2));         // "\xC3\xBF\xE0\xBF\xBF\xEF\xBF\xBF"

Performance comparison: See https://jsfiddle.net/r0d9pm25/1/

Results for 500000 iterations in Firefox 47:

6159.91ms encode_fixed_length()
7177.35ms encode_utf8()

Convert UTF-8 String with only 8 Bits per Character

3 Answers3