What is this type of string called?

Question

In python, we can do something like print("some random string".encode().decode('utf-16')) which will output: 潳敭爠湡潤⁭瑳楲杮.

I feel like that is utf-16, but I'm not really sure, because I can't reproduce it in any other language. My goal is to create a function that will do exactly this, but in Javascript. The problem is that I can't find what of what type if this type of string...

Does someone know how this is called or/and how I could reproduce this in JS ?

Maybe this can help? https://stackoverflow.com/questions/37596748/how-do-i-encode-a-javascript-string-in-utf-16 — Jonathan Hamel, Jan 22 '21 at 20:59
So you want two characters to be the representation of one? What if the original characters are not in the first unicode page? JavaScript strings are already encoded as utf-16. Maybe you could explain what your higher level goal is. Why do you need this? — trincot, Jan 22 '21 at 21:02
Hello, yes, this question is useful. I already saw this page and some functions are converting string to like unicode form \uXXXX, it is not the way I want it (like the asian looking thingy) — Clément Guibout, Jan 22 '21 at 21:05
@trincot Yes, this is exactly what I want to do. I am currently strugling to represent two chars as one in JS, but in Python it's easier — Clément Guibout, Jan 22 '21 at 21:06
What if the original characters are not in the first unicode page? JavaScript strings are already encoded as utf-16. Maybe you could explain what your higher level goal is. Why do you need this? — trincot, Jan 22 '21 at 21:08
You took a `str` and encoded it with the systems default encoding (which is UTF-8 in most cases). Then you decoded the resulting `bytes` (wrongly) as UTF-16. — Klaus D., Jan 22 '21 at 21:11
I hope you will answer the "why" question, because when encoding and decoding do not match, you can get errors; you will have problems with odd number of characters; and it does not save memory, CPU cycles, nothing. — VPfB, Jan 22 '21 at 21:14
This is not some specific kind of string. In Python, all strings are unicode, they done have an encoding. You can *encode* a string in a particular encoding to produce `bytes`. If you then *decode* it back into a string with a different encoding, you may or may not get an error, and it may not return what you expect — juanpa.arrivillaga, Jan 22 '21 at 21:14
@trincot Why python, we can execute this kind of """"encrypted"""" string with `exec(bytes(encoded_string,"u16")[2:])`. My goal is to make a code golfer, so from a starting code of let's say X chars long, we could produce a code that is X/2+len(coating_code_needed_to_execute) long. The goal is to do this in JS. Is this possible in JS ? — Clément Guibout, Jan 22 '21 at 21:27
Please don't call it encryption, even in quotes. It's not encryption. I've seen people do almost exactly this and think it was encryption. It's obfuscation at best. Python and JavaScript both have access to the same standard set of actual encryption algorithms if that's what you want. If you want a one-way conversion from a human-readable string to something obfuscated, use a hashing function. If you want it to be anything like secure, you'll need to salt it. Encryption is hard to do correctly. Please don't call this encryption. — Adam Azarchs, Jan 25 '21 at 19:29

score 2 · Accepted Answer · answered Jan 22 '21 at 21:25

2

A string is a sequence of runes. Unicode is a standard for assigning numeric values to those runes. UTF-8 or UTF-16 are standards for encoding a sequence of runes, as represented by their unicode numeric values, as a sequence of bytes.

What you did there is use encode with the default encoding, which is UTF-8, to get a sequence of bytes which you then tried to decode back to runes as if the bytes had come from a UTF-16 encoding. Basically (because your input string fits in a 1-byte encoding for UTF-8) you're taking pairs of characters from the input, jamming their bytes together and hoping that the resulting value is a legal UTF-16 encoding of something (which in general you cannot count on being true). You'll also run into issues if the utf-8 encoding is not an even number of bytes, of course.

If you really need to do this thing in javascript, you could do something like this:

const str = "some random string";
var buf = new ArrayBuffer(str.length);
// Reinterpret the sequence of bytes as a sequence of byte pairs.
var bufView = new Uint16Array(buf);
for (var i=0, strLen=str.length; i < strLen-1; i+=2) {
  var c1 = str.charCodeAt(i);
  var c2 = str.charCodeAt(i+1);
  if (c1 > 127 || c2 > 127) {
    // This will be a problem.  How you handle it is up to you.
  }
  bufView[i/2] = c1 << 8 | c2;
}
console.log(String.fromCharCode.apply(String, bufView));

answered Jan 22 '21 at 21:25

Adam Azarchs

96
5

Hello. Thanks for your answer, I see that your code is working fine, but I don't understand something. The line `bufView[i/2] = c1 << 8 | c2;` is a bit obscur to me, why are you shifting c1 by 8 ? And why | c2 ? – Clément Guibout Jan 22 '21 at 21:37
Edit: this is not working as expected ahah. At first, we can think that your code is working because it produces the same type of chars, but in fact this is not producing the same result as with Python. I am trying to correct this, but since I don't understand all your code, it is kinda hard – Clément Guibout Jan 23 '21 at 11:43
`<<` is the bit shift operator, `|` is bitwise or. You can get the same result you showed with the python code if you switch the places of `c1` and `c1` in the equation that does the bit-packing. – Adam Azarchs Jan 25 '21 at 19:38
This brings up another issue with what you're trying to do, which is byte order. If you have two bytes, say for example the first one is 1 and the second is 2, and you want to interpret them as a single 16-bit number (which is what you're doing here), it could be either 258 or 513 depending on the _byte order_. The default byte order will depend on which operating system and CPU you're using - e.g. Windows x86 will default to little-endian, while Android on ARM may default to big-endian. If you want to guarantee consistency, you'd need to specify `utf-16-le` encoding. – Adam Azarchs Jan 25 '21 at 19:39

What is this type of string called?

1 Answers1