parameter from package.json script (Encoding problem)

Question

https://nodejs.org/docs/latest/api/process.html#processargv https://www.golinuxcloud.com/pass-arguments-to-npm-script/

passing a parameter by invoking a script in package.json as follows:

--pathToFile=./ESMM/Parametrização_Dezembro_PS1_2022.xlsx

in code retrieve that parameter as argument

const value = process.argv.find( element => element.startsWith( `--pathToFile=` ) );
const pathToFile=value.replace( `--pathToFile=` , '' );

The string that's obtain seems to be in the wrong format/encoding

./ESMM/Parametriza├º├úo_Dezembro_PS1_2022.xlsx

I tried converting to latin1 (other past issues were fixed with this encoding)

const latin1Buffer = buffer.transcode(Buffer.from(pathToFile), "utf8", "latin1");
const latin1String = latin1Buffer.toString("latin1");

but still don't get the string in the correct encoding:

./ESMM/Parametriza?º?úo_Dezembro_PS1_2022.xlsx

My package.json is in UTF-8.

My current locale is (chcp): Active code page: 850

OS: Windows

This seems to be related to:

will try those configurations

    const min = parseInt("0xD800",16), max = parseInt("0xDFFF",16);
    console.log(min);//55296
    console.log(max);//57343

    let textFiltered = "",specialChars = 0;
    for(let charAux of pathToFile){
        const hexChar = Buffer.from(charAux, 'utf8').toString('hex');
        console.log(hexChar)
        const intChar = parseInt(hexChar,16);
        if(hexChar.length > 2){
        //if(intChar>min && intChar<max){
            //console.log(Buffer.from(charAux, 'utf8').toString('hex'))
            specialChars++;
            console.log(`specialChars(${specialChars}): ${hexChar}`);
        }else{
            textFiltered += String.fromCharCode(intChar);
        }
    }

console.log(textFiltered); //normal characters

./ESMM/Parametrizao_Dezembro_PS1_2022.xlsx

console.log(specialChars(${specialChars}): ${hexChar}); //specialCharacters

specialChars(1): e2949c  
specialChars(2): c2ba  
specialChars(3): e2949c  
specialChars(4): c3ba

seems that e2949c hex value to indicate a special character since it repeats and 0xc2ba should be able to convert to "ç" and 0xc3ba to "ã" idealy still trying to figure that out.

Each Unicode codepoint can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits

You face a [mojibake](https://en.wikipedia.org/wiki/Mojibake) case (*example in Python for its universal intelligibility*): `'├º├ú'.encode( 'cp850').decode( 'utf-8')` returns `'çã'` and vice versa: `'çã'.encode( 'utf-8').decode( 'cp437')` returns `'├º├ú'`… — JosefZ, Dec 07 '22 at 21:35
Perhaps see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors; the [tag:character-encoding] tag's [info page](/tags/character-encoding/info) has some additional hints for how to ask a well-formed question. What is your current locale and encoding, and which OS are you on? — tripleee, Dec 12 '22 at 14:47
my current locale and encoding using chcp is Active code page: 850, the OS is Windows — H.C, Dec 12 '22 at 14:53
Getting the array of hex with const TextBuffer= buffer.transcode(Buffer.from(pathToFile), "utf16le") and will try to convert each charater knowing that "each extra characters are stored in UTF-16 as surrogate pairs" (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#utf-16_characters_unicode_codepoints_and_grapheme_clusters) — H.C, Dec 12 '22 at 14:59
If the file name is correct on the command line, you should be fine. My suspicion is that you got thrown off by the gunk when you tried to print it, but if you have a file with that name, you should just pass that as an oblique token to the OS without further mangling. — tripleee, Dec 12 '22 at 17:18
If I pass that parameter directly to const reader = require('xlsx') const file = reader.readFile(pathToFile); ends with error . But replacing directly const file = reader.readFile('./ESMM/Parametrização_Dezembro_PS1_2022.xlsx'); works as expected. — H.C, Dec 12 '22 at 17:27
the character encoding of the script is UTF-8, also added more information into the question — H.C, Dec 12 '22 at 17:43
So if you have a file whose name is valid UTF-8 but your console only supports CP850, you would use `--pathToFile=./ESMM/Parametriza├º├úo_Dezembro_PS1_2022.xlsx` (or switch to a console which supports Unicode, or even a better OS). But that doesn't seem to agree with what you originally posted. If you tab-complete the file name at the CMD prompt, what do you get? — tripleee, Dec 12 '22 at 17:46
get the name file correctly: .\Parametrização_Dezembro_PS1_2022.xlsx — H.C, Dec 12 '22 at 17:53

H.C · Accepted Answer · 2022-12-16T12:15:43.657

As @JosefZ indicated but for Python, in my case gona use a direct conversion since will alls have the keyword "Parametrização" as part of the parameter.

The probleam that encountered in this case is that my package.json and my script are in the correct format UTF8 as stated by @tripleee (thanks for the help providade) but process.argv that returns <string[]> that basicaly UTF16... so my solution is deal with the ├ that in hex is "e2949c" and retrive the correct characters:

const UTF8_Character = "e2949c" //├
//for this cases use this json/array that haves the correct encoding
const personalized_encoding = {
    "c2ba": "ç",
    "c3ba": "ã"
}

let textFiltered = "",specialChars = 0;
for(let charAux of pathToFile){
    const hexChar = Buffer.from(charAux, 'utf8').toString('hex');
    //console.log(hexChar)
    const intChar = parseInt(hexChar,16);
    if(hexChar.length > 2){
        if(hexChar === UTF8_Character) continue;
        specialChars++;
        //console.log(`specialChars(${specialChars}): ${hexChar}`);
        textFiltered += personalized_encoding[hexChar];
    }else{
        textFiltered += String.fromCharCode(intChar);
    }
}

console.log(textFiltered);

parameter from package.json script (Encoding problem)

1 Answers1