1

I have received a broken string from another piece of software. I would have liked to fix its encoding in JavaScript but I feel I am missing something.

Here's an exemple of broken string: Détecté àlors ôù
And the expected output would be: Détecté àlors ôùi

I don't know the encoding used to send me the string.

My idea is to use the TextDecoder API; convert the string to bytes, and then reencode it in UTF8 or UTF16.

Here's the piece of code I used to detect the charset used:

const str = 'Détecté àlors ôùi';
const str2 = 'Détecté àlors ôù';

const charsets = [
  'utf-8',
  "ibm866",
  "iso-8859-2",
  "iso-8859-3",
  "iso-8859-4",
  "iso-8859-5",
  "iso-8859-6",
  "iso-8859-7",
  "iso-8859-8",
  "iso-8859-8-i",
  "iso-8859-10",
  "iso-8859-13",
  "iso-8859-14",
  "iso-8859-15",
  "iso-8859-16",
  "koi8-r",
  "koi8-u",
  "macintosh",
  "windows-874",
  "windows-1250",
  "windows-1251",
  "windows-1252",
  "windows-1253",
  "windows-1254",
  "windows-1255",
  "windows-1256",
  "windows-1257",
  "windows-1258",
  "x-mac-cyrillic",
  "gbk",
  "gb18030",
  "hz-gb-2312",
  "big5",
  "euc-jp",
  "iso-2022-jp",
  "shift-jis",
  "euc-kr",
  "iso-2022-kr",
  "utf-16be",
  "utf-16le",
  "iso-2022-cn"
];

const encoder = new TextEncoder();
const view = encoder.encode(str2);

console.log('__________________')

charsets.forEach((charset) => {
  try {
    const decoder = new TextDecoder(charset);
    const fixedStr = decoder.decode(view, {
      fatal: false,
      ignoreBOM: true,
    });

    console.log(charset, fixedStr);
  } catch (e) {
    console.log(charset, 'invalid');
  }
})

(the code can be tested here: https://jsfiddle.net/tashebwj/ )

The output is the following:

__________________
?editor_console=true:57 utf-8 Détecté àlors ôù
?editor_console=true:57 ibm866 D├Г┬йtect├Г┬й ├Г┬аlors ├Г┬┤├Г┬╣
?editor_console=true:57 iso-8859-2 DĂŠtectĂŠ Ă lors Ă´Ăš
?editor_console=true:57 iso-8859-3 D�Âİtect�Âİ � lors �´�Âı
?editor_console=true:57 iso-8859-4 DÊtectÊ àlors ôÚ
?editor_console=true:57 iso-8859-5 DУТЉtectУТЉ УТ lors УТДУТЙ
?editor_console=true:57 iso-8859-6 Dأآ�tectأآ� أآ lors أآ�أآ�
?editor_console=true:57 iso-8859-7 DΓΒ©tectΓΒ© ΓΒ lors ΓΒ΄ΓΒΉ
?editor_console=true:57 iso-8859-8 D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-8-i D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-10 DÃÂĐtectÃÂĐ Ã lors ÃÂīÃÂđ
?editor_console=true:57 iso-8859-13 DĆĀ©tectĆĀ© ĆĀ lors Ć“ù
?editor_console=true:57 iso-8859-14 Détecté àlors ÃÂṀÃÂṗ
?editor_console=true:57 iso-8859-15 Détecté àlors ÃŽù
?editor_console=true:57 iso-8859-16 DĂ©tectĂ© Ă lors ĂÂŽĂÂč
?editor_console=true:57 koi8-r Dц┐б╘tectц┐б╘ ц┐б═lors ц┐б╢ц┐б╧
?editor_console=true:57 koi8-u Dц┐б╘tectц┐б╘ ц┐б═lors ц┐бЄц┐б╧
?editor_console=true:57 macintosh Détecté àlors ôù
?editor_console=true:57 windows-874 Dรยฉtectรยฉ รย lors รยดรยน
?editor_console=true:57 windows-1250 DĂ©tectĂ© Ă lors Ă´ĂÂą
?editor_console=true:57 windows-1251 DГѓВ©tectГѓВ© ГѓВ lors ГѓВґГѓВ№
?editor_console=true:57 windows-1252 Détecté àlors ôù
?editor_console=true:57 windows-1253 Détecté àlors ôù
?editor_console=true:57 windows-1254 Détecté àlors ôù
?editor_console=true:57 windows-1255 Dֳƒֲ©tectֳƒֲ© ֳƒֲ lors ֳƒֲ´ֳƒֲ¹
?editor_console=true:57 windows-1256 Dأƒآ©tectأƒآ© أƒآ lors أƒآ´أƒآ¹
?editor_console=true:57 windows-1257 DĆĀ©tectĆĀ© ĆĀ lors Ć´ù
?editor_console=true:57 windows-1258 DĂƒÂ©tectĂƒÂ© ĂƒÂ lors ĂƒÂ´ĂƒÂ¹
?editor_console=true:57 x-mac-cyrillic D√Г¬©tect√Г¬© √Г¬†lors √Г¬і√Г¬є
?editor_console=true:57 gbk D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 gb18030 D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 hz-gb-2312 invalid
?editor_console=true:57 big5 D�穢tect�穢 ��饊ors �織�繒
?editor_console=true:57 euc-jp D�息tect�息 ��lors �卒�孫
?editor_console=true:57 iso-2022-jp D����tect���� ����lors ��������
?editor_console=true:57 shift-jis Dテδゥtectテδゥ テδ�lors テδエテδケ
?editor_console=true:57 euc-kr D횄짤tect횄짤 횄혻lors 횄쨈횄쨔
?editor_console=true:57 iso-2022-kr invalid
?editor_console=true:57 utf-16be 䓃菂ꥴ散瓃菂ꤠ쎃슠汯牳⃃菂듃菂�
?editor_console=true:57 utf-16le 썄슃璩捥썴슃₩菃ꃂ潬獲쌠슃쎴슃�
?editor_console=true:57 iso-2022-cn invalid

Why this method does not work? Is it possible to fix the string with this method or another way?

Alexis Delrieu
  • 1,313
  • 2
  • 10
  • 19
  • 1
    A problem here is that you're starting with the data *as Javascript string literal* of garbage characters. That is not really the original data; it's the original data already misinterpreted and the resulting garbage expressed as proper Javascript string. You should start any encoding conversions from the original bytes, not from an already converted wrong result. Can you detail more how you got this string? – deceze Jun 08 '23 at 13:35
  • @deceze thanks for your answer. I see your point, we lose information in a way making it impossible to reencode, don't we? I get the original string from an output from a piece of software of my employer. – Alexis Delrieu Jun 08 '23 at 13:53
  • 1
    Yes, you would at least have to undo whatever conversion has already been done up to the point of having a JS string literal; and it's possible the conversion was lossy. – deceze Jun 08 '23 at 14:03
  • Alright, it makes sense why I was not able to do it then. – Alexis Delrieu Jun 08 '23 at 14:16
  • I found this library jschardet ( https://github.com/aadsm/jschardet ) to detect the encoding of a js string. I have updated the jsfiddle to use it and it looks like it properly detect the encoding of the strings. https://jsfiddle.net/entzy31p/ – Alexis Delrieu Jul 04 '23 at 11:28

0 Answers0