1

If I receive a UTF-8 string via a socket (or for that matter via any external source) I would like to get it as a properly parsed string object. The following code shows what I mean

var str='21\r\nJust a demo string \xC3\xA4\xC3\xA8-should not be anymore parsed';

// Find CRLF
var i=str.indexOf('\r\n');

// Parse size up until CRLF
var x=parseInt(str.slice(0, i));

// Read size bytes
var s=str.substr(i+2, x)

console.log(s);

This code should print

Just a demo string äè

but as the UTF-8 data is not properly parsed it only parses it up to the first Unicode character

Just a demo string ä

Would anyone have an idea how to convert this properly?

user3847784
  • 13
  • 1
  • 1
  • 3
  • You may want to use [Punycode](https://en.wikipedia.org/wiki/Punycode), here is a library too: https://github.com/bestiejs/punycode.js/ – howderek Jul 17 '14 at 17:29
  • This might help: http://stackoverflow.com/questions/17057407/javascript-create-a-string-or-char-from-an-utf-8-value – Diodeus - James MacFarlane Jul 17 '14 at 17:29
  • @howderek Thanks, but how would a punycode library help in this case? – user3847784 Jul 17 '14 at 17:30
  • nvm, I thought you were doing this over http, use this string instead: '21\r\nJust a demo string \xE4\xE8\xC3\xA8-should not be anymore parsed' you simply used the wrong escapes – howderek Jul 17 '14 at 17:36

2 Answers2

1

It seems you could use this decodeURIComponent(escape(str)):

var badstr='21\r\nJust a demo string \xC3\xA4\xC3\xA8-should not be anymore parsed';

var str=decodeURIComponent(escape(badstr));

// Find CRLF
var i=str.indexOf('\r\n');

// Parse size up until CRLF
var x=parseInt(str.slice(0, i));

// Read size bytes
var s=str.substr(i+2, x)

console.log(s);

BTW, this kind of issue occurs when you mix UTF-8 and other types of enconding. You should check that as well.

1

You should use utf8.js which is available on npm.

var utf8 = require('utf8');
var encoded = '21\r\nJust a demo string \xC3\xA4\xC3\xA8-foo bar baz';
var decoded = utf8.decode(encoded);
console.log(decoded);
Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248