3

I am building a web application in NodeJS version 12. I have data from an old MySQL database. There are several fields that contain characters that are not displaying properly due to an encoding issue with the old database. There are some similar questions already but none of them have solved my issue. After trying, I'm a little closer to a solution, but still need help on this.

Current value in database to convert:

Rikuchi SokuryoÌ„bu [cartographer], 陸地測é‡éƒ¨

Desired new database value:

Rikuchi Sokuryōbu [cartographer], 陸地測量部

The issue is the same as described in this similar question. However, the accepted answer does not solve my issue. I need to write a NodeJS to convert the data in the database into a readable string.

I also tried following the answer in this similar question. I understand that the value needs to be first converted to binary and then to the desired encoding. However, it does not return the desired result. I tried this with the iconv and iconv-lite packages.

ATTEMPT 1:

let buf = new Buffer(body, 'binary');
let conv = new iconv.Iconv('windows-1252', 'utf8');
let str = conv.convert(buf).toString();
console.log(`original: ${body} output: ${str.toString()}`);

// original: Rikuchi SokuryoÌ„bu [cartographer], 陸地測é‡éƒ¨
// output: Rikuchi SokuryoМbu [cartographer], й"ёеS°жё¬й!Џй’Ё

ATTEMPT 2: iconv-lite

let buf = new Buffer(body, 'binary');
const str = iconvlite.decode(buf, 'windows-1252');
console.log(`original: ${body} output: ${str.toString()}`);

// original: Rikuchi SokuryoÌ„bu [cartographer], 陸地測é‡éƒ¨
// output: Rikuchi SokuryoМbu [cartographer], й"ёеS°жё¬й!Џй’Ё

ATTEMPT 3: iconv-lite

// This one *almost* works however there are still some undefined characters

let buf = new Buffer(body, 'utf-8');
const win = iconvlite.encode(buf, 'windows-1252');
console.log(`original: ${body} output: ${win.toString()}`);

// original: Rikuchi SokuryoÌ„bu [cartographer], 陸地測é‡éƒ¨
// output: Rikuchi Sokuryōbu [cartographer], 陸地測�?部

UPDATE:

This website string-functions.com can encode and decode strings.

The entire problematic string is decoded correctly with the settings: "Encode with: Windows-1252" and "Decode with: utf-8"

It also works perfectly for larger examples of this problem. I just need to replicate exactly how this site is doing the conversion. My code in attempt #3 is very close, but there must be a step missing.

stringfunctions.com

pengz
  • 2,279
  • 3
  • 48
  • 91

3 Answers3

4

I solved this by using the windows-1252 module to encode the original text and then decoded it using the iconv-lite module.

const win = windows1252.encode(body);
const str = iconvlite.decode(win, 'utf-8');
return str.toString();
pengz
  • 2,279
  • 3
  • 48
  • 91
  • 1
    I call this scheme "Base256": _encoding_ a byte string as if it was text by _decoding_ it with a character encoding that supports arbitrary sequences of any byte value 0-255. The specific character encoding should be documented. In this case, the byte sequence is the result of applying the UTF-8 character encoding to text. That too should be documented. You've discovered how to reverse both steps. – Tom Blodget Jul 06 '19 at 18:03
  • 1
    I had the same issue with some old `latin1` mysql tables, and this was the only solution that got all of my characters right! The Windows-1252 module properly encoded `Don’t worry.` and `Malört`, where all of the other solutions missed either the apostrohe or the ö. – broox Mar 11 '20 at 03:37
  • 1
    This is an old post but this might have just saved me from the biggest headache of my life. – Jaxon Crosmas Jul 26 '22 at 03:13
4

Great post, and thank you @pengz, that worked for me. I agree all the other solutions missed either the apostrophe and other characters. This is the only solution that worked for me. One thing to note however: I was able to accomplish the same, but all with iconvlite:

const win = iconvlite.encode(body, 'windows-1252');
const str = iconvlite.decode(win, 'utf-8');
return str.toString();
Kevin
  • 603
  • 5
  • 7
  • For anyone looking at this in the future, I tried using `Iconv` for both the encode to windows-1252 and the decode to utf-8, but this only worked in certain cases for me. Using the [windows-1252](https://www.npmjs.com/package/windows-1252) module seems to be more reliable for encoding to windows-1252. Make sure to try both methods to figure out what works for you. – Jaxon Crosmas Jul 26 '22 at 13:23
2

If you get in response something like questions mark (Unicode block) , it means your notepad (IDE) shows it in utf8 format, but it gets its data in another format.

To fix it I used iconv:

const { Iconv } = require('iconv');
const convertedXmlResponse = Iconv('windows-1251', 'utf8')
    .convert(dataBuffer)
    .toString();
Daniel_Knights
  • 7,940
  • 4
  • 21
  • 49
ahd3r
  • 21
  • 3