0

Two questions in one, not sure if that's allowed, but they're directly related to the same code. I retrieve a CSV string as a HTTP response in Javascript - this string comes in UTF-16 encoding it seems, as it has for example ' € ' instead of '€'.

a) How can I convert this to UTF-8 in vanilla Javascript?

Once that's done, how do I b) transform the multi-line CSV into a 2D array in vanilla Javascript?

Thanks!


[UPDATE]

Based on anqooqie's pointers, I take the following approach to re-encode the string:

OK, clear - so to be honest, I went a slightly different way (as the reencode function didn't work for me and it threw a generic error code) and now do the below;

var O = new ActiveXObject('ADODB.Stream');
O.Type = 2; 
O.Open;
O.Charset = 'ISO-8859-1';
O.LineSeparator = 10;
O.WriteText (csvStr);
O.Position = 0;
O.Charset = 'UTF-8';

And this works fine and in pretty much a split second (even though it's a 35K row CSV). Now if I want to put it back into the csvStr, I would do

csvStr = O.ReadText

but this takes ages - is that expected or am I doing something wrong?

For putting it into a 2D array, I split on the LineSeparator and then loop using a regex, which seems to work.

var A = new Array
A.push(csvStr[0].match(/"[^"]*"|[^,]+/g))

The vast delay on the readText is bothering me though, especially as the WriteText is so quick. Any help is appreciated.

JasperD
  • 152
  • 1
  • 3
  • 15
  • This is just the comment to your question, but are you perhaps writing JScript on WSH? JavaScript has many variants, and JScript on WSH is one of them. Generally speaking, it is recommended to write your environment explicitly in your question since it leads to more appropriate answers by more appropriate answerers. – anqooqie Apr 09 '19 at 16:33
  • writing jscript - so ecma3 (better known as old-cr*p vanilla javascript :-) - apologies for leaving that out. – JasperD Apr 09 '19 at 18:00

1 Answers1

2

Looks like you are confused about the terms of character encoding, so let's reconfirm that.

String is just a string. There is no "UTF-16 string", nor "UTF-8 string".

Character encoding is a protocol which converts between a string and a byte array. UTF-16 is one of the character encodings. Also, both of UTF-8 and ISO-8859-1 are character encodings. In UTF-16, the string '€' can be encoded to a byte array 20 AC. In UTF-8, the string '€' can be encoded to a byte array E2 82 AC. In ISO-8859-1, the byte array E2 82 AC can be decoded to a string 'â¬'.

Now, you may find that 'â¬' is not a "UTF-16 string". It is '€' encoded as UTF-8 and mistakenly decoded as ISO-8859-1.

a) How can I convert this to UTF-8 in vanilla Javascript?

What you should do is to fix the code to retrieve a CSV file. I cannot tell you how to fix it since I do not know your code, but I believe that it now decodes a CSV file as ISO-8859-1. You should fix the character encoding from ISO-8859-1 to UTF-8.

If the code is not yours and you cannot fix it, you can use a workaround. In other words, you can 1) re-encode a mistakenly decoded string as ISO-8859-1, and 2) re-decode it as UTF-8.

1)

// Note: This code requires ES5 or later.
function reencode(inputString) {
  return Array.apply(null, Array(inputString.length)).map(function (x, i) { return inputString.charCodeAt(i); });
}

2)

See this answer.

b) How do I transform the multi-line CSV into a 2D array in vanilla Javascript?

See this answer.

anqooqie
  • 435
  • 3
  • 17
  • First of - THANK YOU for taking the time to answer, it's much appreciated. Unlike what it looks like, I did actually look for answers on the site, but couldn't find suitable ones. Can you share what you searched for in order for those answers to come up? I'll try your suggestions and revert back. – JasperD Apr 09 '19 at 11:17
  • To break down your question, I did not search because I knew that. To answer a-1), I did not search because I knew that. To answer a-2), I searched for "decode utf-8 javascript" in Google. To answer b), I searched for "csv javascript stack overflow" in Google and found [a duplicate question](https://stackoverflow.com/q/1293147/5688192). – anqooqie Apr 09 '19 at 14:12
  • Based on your reply, I have updated my original question (with a different approach, but couldn't have done it without your help). If you can, please have a look at the one open pointer, as it's breaking my head at the moment :) – JasperD Apr 09 '19 at 15:31
  • Yes, treating built-in text datatypes as just containing text is ideal. But some text processing uses length and position indexes. In such cases it is sometimes inescapable to rely on, in the case of JavaScript (.NET, Java, ...), a string is a counted sequence of UTF-16 code units. – Tom Blodget Apr 10 '19 at 10:34