125

How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.

UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);

As UnicodeEncoding is by default of UTF-16 with Little-Endianness.

Edit: I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
shas
  • 1,301
  • 2
  • 10
  • 10
  • 3
    javascript is not exactly best-known for being easy to use with BLOBs - why don't you just send the string in JSON? – Marc Gravell Jun 03 '11 at 10:58
  • 2
    A Javascript string is UTF-16, or did you know this already? – Kevin Jun 03 '11 at 11:02
  • 2
    First of all why you need to convert this in javascript? – BreakHead Jun 03 '11 at 11:07
  • Maybe you can take a look [here](http://stackoverflow.com/questions/3195865/javascript-html-converting-byte-array-to-string) .. – V4Vendetta Jun 03 '11 at 11:02
  • 22
    Strings are not encoded. Yes, internally they are represented as bytes and they have an encoding, but that's essentially meaningless at the scripting level. Strings are logical collections of characters. To encode a character, you must explicitly choose an encoding scheme, which you can use to transform each character code into a sequence of one or more bytes. The answers to this question below are garbage, as they call charCodeAt and stick its value into an array called "bytes". Hello! charCodeAt can return values greater than 255, so it's not a byte! – Triynko Aug 06 '13 at 21:15
  • Try this: https://stackoverflow.com/questions/6226189/how-to-convert-a-string-to-bytearray – jchook Jun 01 '18 at 19:07
  • @jchook, you linked back to this page. Probably not your intent. – Alex Coventry Aug 17 '18 at 21:18
  • @AlexCoventry thanks. Here is the intended link: https://gist.github.com/joni/3760795 – jchook Aug 18 '18 at 01:21
  • @Mark Gravel. TO aswer your question for my scenario that brought me here, is that the text / json i preserve in a UI of HTML, it is sometimes to large to send to back end in a ajax post. So i am trying to see if I can convert it first. This is not due to the string length, but json restriction itself in length. Just to give an idea of why some may be looking at this. – Casey ScriptFu Pharr May 21 '20 at 04:46

12 Answers12

58

Update 2018 - The easiest way in 2018 should be TextEncoder

let utf8Encode = new TextEncoder();
utf8Encode.encode("abc");
// Uint8Array [ 97, 98, 99 ]

Caveats - The returned element is a Uint8Array, and not all browsers support it.

David Klempfner
  • 8,700
  • 20
  • 73
  • 153
code4j
  • 4,208
  • 5
  • 34
  • 51
  • This is peculiar. I don't suppose using different variable names as utf8Decode and utf8Encode would work. – Unihedron Mar 23 '19 at 06:28
  • You can use [TextDecoder](https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder) to decode: `new TextDecoder().decode(new TextEncoder().encode(str)) == str`. – Fons Apr 01 '20 at 15:14
  • Clarification on "not all browsers." The only contemporary browser in that table that doesn't support it now is Opera Mini. – General Grievance Nov 17 '22 at 17:06
40

If you are looking for a solution that works in node.js, you can use this:

var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
}

console.log(myBuffer);
Jin
  • 546
  • 1
  • 6
  • 10
  • 4
    This is for node.js but I think the question is looking for a solution that works in a browser. Nevertheless it does work correctly, unlike most of the other answers to this question, so +1. – Daniel Cassidy Apr 03 '16 at 19:34
  • This works but much simpler code is function convertString(myString) { var myBuffer = new Buffer(myString, 'utf16le'); console.log(myBuffer); return myBuffer; } – Philip Rutovitz Jan 27 '20 at 21:44
  • 1
    Since new Buffer has been deprecated, so should use from: var buffer = Buffer.from(str, 'utf16le'); – Geoffrey Bourne Aug 26 '21 at 00:57
  • 2
    As of Nov 5, 2021, `new Buffer` fails because `Buffer` is not defined in Chrome browser – PatS Nov 05 '21 at 15:19
  • @PatS This isn't meant for use in the browser. – Unmitigated Jan 29 '23 at 00:26
31

In C# running this

UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");

Will create an array with

72,0,101,0,108,0,108,0,111,0

byte array

For a character which the code is greater than 255 it will look like this

byte array

If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)

var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes

for (var i = 0; i < str.length; ++i) {
  var code = str.charCodeAt(i);
  
  bytes = bytes.concat([code]);
  
  bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}

// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));

// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));
BrunoLM
  • 97,872
  • 84
  • 296
  • 452
  • 1
    I have already tried this but this gives me the different result than the above C# code. Like for this case the C# code output byte array is = 72,0,101,0,108,0,108,0,111,0 I have a requirement to match both so thats not working. – shas Jun 03 '11 at 11:55
  • @shas, seems to be the same, just adding a `0` after each character. The updated answer should do the same as `c#` now. – BrunoLM Jun 03 '11 at 12:12
  • I am getting undefined JS error for str[i]. what you are trying to access. Shouldn't it be str.charCodeAt(i) ? – shas Jun 03 '11 at 12:33
  • 2
    @shas I tested the previous only on Firefox 4. The updated version was tested on Firefox 4, Chrome 13 and IE9. – BrunoLM Jun 03 '11 at 12:37
  • 41
    Note that if the string contains unicode chars, charCodeAt(i) will be > 255, which is probably not what you want. – broofa Aug 11 '12 at 11:33
  • 25
    Yeah, this is incorrect. charCodeAt does not return a byte. It makes no sense to push a value greater than 255 into an array called "bytes"; very misleading. This function does not perform encoding at all, it just sticks the character codes into an array. – Triynko Aug 06 '13 at 21:19
  • 2
    I dont understand why this answer is marked as correct since it does not encode anything. – A.B. Sep 08 '15 at 13:05
  • 1
    This is totally wrong. A character is not a byte. This will fail for any characters outside the range U+0000 - U+00FF. – Daniel Cassidy Apr 03 '16 at 19:26
  • I think `code >>> 8` would be a more reasonable operation than `code / 256 >>> 0` – Patrick Roberts Aug 18 '18 at 01:31
16

I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:

var str = "Hell ö € Ω ";
var bytes = [];
var charCode;

for (var i = 0; i < str.length; ++i)
{
    charCode = str.charCodeAt(i);
    bytes.push((charCode & 0xFF00) >> 8);
    bytes.push(charCode & 0xFF);
}

alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30

I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes adds following bytes: 254 255.

String s = "Hell ö € Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"

byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
    System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30

Edit:

Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.

Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.

I'm not sure but when using charCodeAt it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.

This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:

hgoebl
  • 12,637
  • 9
  • 49
  • 72
  • 1
    Still not a complete answer. UTF16 is a variable-length encoding that uses 16-bit chunks to represent characters. A single character will either be encoded as 2 bytes or 4 bytes, depending on how big the charcter code value is. Since this function writes at most 2 bytes, it cannot handle all unicode character code points, and is not a complete implementation of UTF16 encoding, not by a long shot. – Triynko Aug 06 '13 at 21:24
  • @Triynko after my edit and test, do you still think this is not the complete answer? If yes, do you have an answer? – hgoebl Nov 09 '13 at 14:18
  • 2
    @Triynko You are half right, but actually this answer does work correctly. JavaScript strings are not actually sequences of Unicode Code Points, they are sequences of UTF-16 Code Units. Despite the name, `charCodeAt` returns a UTF-16 Code Unit, in the range 0-65535. Characters outside the 2-byte range are represented as surrogate pairs, just like in UTF-16. (By the way, this is true of strings in several other languages, including Java and C#.) – Daniel Cassidy Apr 03 '16 at 19:49
  • By the way, `(charCode & 0xFF00) >> 8` is redundant, you don't need to mask it before shifting. – Patrick Roberts Aug 18 '18 at 01:33
15

UTF-16 Byte Array

JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding, so creating a byte array is relatively straightforward.

JavaScript's charCodeAt() returns a 16-bit code unit (aka a 2-byte integer between 0 and 65535). You can split it into distinct bytes using the following:

function strToUtf16Bytes(str) {
  const bytes = [];
  for (ii = 0; ii < str.length; ii++) {
    const code = str.charCodeAt(ii); // x00-xFFFF
    bytes.push(code & 255, code >> 8); // low, high
  }
  return bytes;
}

For example:

strToUtf16Bytes(''); 
// [ 60, 216, 53, 223 ]

This works between C# and JavaScript because they both support UTF-16. However, if you want to get a UTF-8 byte array from JS, you must transcode the bytes.

UTF-8 Byte Array

The solution feels somewhat non-trivial, but I used the code below in production with great success (original source).

Also, for the interested reader, I published my unicode helpers that help me work with string lengths reported by other languages such as PHP.

/**
 * Convert a string to a unicode byte array
 * @param {string} str
 * @return {Array} of bytes
 */
export function strToUtf8Bytes(str) {
  const utf8 = [];
  for (let ii = 0; ii < str.length; ii++) {
    let charCode = str.charCodeAt(ii);
    if (charCode < 0x80) utf8.push(charCode);
    else if (charCode < 0x800) {
      utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
    } else if (charCode < 0xd800 || charCode >= 0xe000) {
      utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
    } else {
      ii++;
      // Surrogate pair:
      // UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
      // splitting the 20 bits of 0x0-0xFFFFF into two halves
      charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
      utf8.push(
        0xf0 | (charCode >> 18),
        0x80 | ((charCode >> 12) & 0x3f),
        0x80 | ((charCode >> 6) & 0x3f),
        0x80 | (charCode & 0x3f),
      );
    }
  }
  return utf8;
}
jchook
  • 6,690
  • 5
  • 38
  • 40
  • and what is the inverse of this? – simbo1905 Jun 19 '19 at 06:32
  • I would describe the inverse function as "convert a UTF-8 byte array to a native UTF-16 string". I never produced the inverse. In myc env, I removed this code by changing the API output to a character range instead of a byte range, then I used [runes](https://github.com/dotcypress/runes) to parse the ranges. – jchook Jun 20 '19 at 02:51
  • I would suggest this should be the accepted answer for this question. – LeaveTheCapital Aug 13 '19 at 14:52
11

Inspired by @hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.

/**@returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
    var bytes = [];
   for (var i = 0; i < str.length; ++i)
   {
       var charCode = str.charCodeAt(i);
      if (charCode > 0xFF)  // char > 1 byte since charCodeAt returns the UTF-16 value
      {
          throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
      }
       bytes.push(charCode);
   }
    return bytes;
}
/**@returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
    var bytes = [];
    //currently the function returns without BOM. Uncomment the next line to change that.
    //bytes.push(254, 255);  //Big Endian Byte Order Marks
   for (var i = 0; i < str.length; ++i)
   {
       var charCode = str.charCodeAt(i);
       //char > 2 bytes is impossible since charCodeAt can only return 2 bytes
       bytes.push((charCode & 0xFF00) >>> 8);  //high byte (might be 0)
       bytes.push(charCode & 0xFF);  //low byte
   }
    return bytes;
}
/**@returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
    var bytes = [];
    //currently the function returns without BOM. Uncomment the next line to change that.
    //bytes.push(0, 0, 254, 255);  //Big Endian Byte Order Marks
   for (var i = 0; i < str.length; i+=2)
   {
       var charPoint = str.codePointAt(i);
       //char > 4 bytes is impossible since codePointAt can only return 4 bytes
       bytes.push((charPoint & 0xFF000000) >>> 24);
       bytes.push((charPoint & 0xFF0000) >>> 16);
       bytes.push((charPoint & 0xFF00) >>> 8);
       bytes.push(charPoint & 0xFF);
   }
    return bytes;
}

UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.

String.prototype.charCodeAt returns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAt is needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArray will throw in such cases instead of splitting the character in half and taking either or both bytes.

Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.

javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2. Also see: https://mathiasbynens.be/notes/javascript-encoding

Yes I know the question is 4 years old but I needed this answer for myself.

SkySpiral7
  • 362
  • 3
  • 8
  • Node's Buffer results for `'02'` are `[ 48, 0, 50, 0 ]` where as your `stringToUtf16ByteArray` function returns `[ 0, 48, 0, 50 ]`. which one is correct? – Philipp Kyeck Nov 21 '18 at 14:41
  • @pkyeck My stringToUtf16ByteArray function above returns UTF-16 BE without BOM. The example you gave from node is UTF-16 LE without BOM. I had thought Big-endian was more normal than little-endian but could be wrong. – SkySpiral7 Nov 25 '18 at 20:04
2

Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer

var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
}

console.log(myBuffer);

by saying that you could use this if you want to use a Node.js buffer in your browser.

https://github.com/feross/buffer

Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.

mmdts
  • 757
  • 6
  • 16
1
String.prototype.encodeHex = function () {
    return this.split('').map(e => e.charCodeAt())
};

String.prototype.decodeHex = function () {    
    return this.map(e => String.fromCharCode(e)).join('')
};
  • 4
    It would be helpful if you provide some text to go along with the code to explain why one might choose this approach rather than one of the other answers. – NightOwl888 Feb 16 '18 at 20:27
  • this approach is simpler than others but do the same, that's the reason i did not wrote anything. – Fabio Maciel Feb 20 '18 at 14:30
  • `encodeHex` will return an array of 16-bit numbers, not bytes. – Pavlo Dec 16 '19 at 09:25
0

The best solution I've come up with at on the spot (though most likely crude) would be:

String.prototype.getBytes = function() {
    var bytes = [];
    for (var i = 0; i < this.length; i++) {
        var charCode = this.charCodeAt(i);
        var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
        for (var j = 0; j < cLen; j++) {
            bytes.push((charCode << (j*8)) & 0xFF);
        }
    }
    return bytes;
}

Though I notice this question has been here for over a year.

Whosdr
  • 25
  • 1
  • 2
    This does not work correctly. The variable length character logic is incorrect, there are no 8-bit characters in UTF-16. Despite the name, `charCodeAt` returns a 16-bit UTF-16 Code Unit, so you don't need any variable length logic. You can just call charCodeAt, split the result into two 8-bit bytes, and stuff them in the output array (lowest-order byte first since the question asks for UTF-16LE). – Daniel Cassidy Apr 03 '16 at 19:58
0

I know the question is almost 4 years old, but this is what worked smoothly with me:

String.prototype.encodeHex = function () {
  var bytes = [];
  for (var i = 0; i < this.length; ++i) {
    bytes.push(this.charCodeAt(i));
  }
  return bytes;
};

Array.prototype.decodeHex = function () {    
  var str = [];
  var hex = this.toString().split(',');
  for (var i = 0; i < hex.length; i++) {
    str.push(String.fromCharCode(hex[i]));
  }
  return str.toString().replace(/,/g, "");
};

var str = "Hello World!";
var bytes = str.encodeHex();

alert('The Hexa Code is: '+bytes+' The original string is:  '+bytes.decodeHex());

or, if you want to work with strings only, and no Array, you can use:

String.prototype.encodeHex = function () {
  var bytes = [];
  for (var i = 0; i < this.length; ++i) {
    bytes.push(this.charCodeAt(i));
  }
  return bytes.toString();
};

String.prototype.decodeHex = function () {    
  var str = [];
  var hex = this.split(',');
  for (var i = 0; i < hex.length; i++) {
    str.push(String.fromCharCode(hex[i]));
  }
  return str.toString().replace(/,/g, "");
};

var str = "Hello World!";
var bytes = str.encodeHex();

alert('The Hexa Code is: '+bytes+' The original string is:  '+bytes.decodeHex());
Hasan A Yousef
  • 22,789
  • 24
  • 132
  • 203
  • 2
    This sort of works, but is extremely misleading. The `bytes` array does not contain 'bytes', it contains 16-bit numbers, which represent the string in UTF-16 code units. This is nearly what the question asked for, but really only by accident. – Daniel Cassidy Apr 03 '16 at 20:07
-2

Here is the same function that @BrunoLM posted converted to a String prototype function:

String.prototype.getBytes = function () {
  var bytes = [];
  for (var i = 0; i < this.length; ++i) {
    bytes.push(this.charCodeAt(i));
  }
  return bytes;
};

If you define the function as such, then you can call the .getBytes() method on any string:

var str = "Hello World!";
var bytes = str.getBytes();
mweaver
  • 208
  • 2
  • 4
  • 32
    This is still incorrect, just like the answer it references. charCodeAt does not return a byte. It makes no sense to push a value greater than 255 into an array called "bytes"; very misleading. This function does not perform encoding at all, it just sticks the character codes into an array. To perform UTF16 encoding, you have to examine the charcter code, decide whether you will need to represent it with 2 bytes or 4 bytes (since UTF16 is a variable-length encoding), and then write each byte to the array individually. – Triynko Aug 06 '13 at 21:20
  • 9
    Also, it is bad practice to modify the prototype of native data types. – Andrew Lundin Oct 30 '13 at 18:18
  • @AndrewLundin , that's interresting... says who? – Jerther Feb 06 '15 at 19:06
  • 2
    @Jerther: http://stackoverflow.com/questions/14034180/why-is-extending-native-objects-a-bad-practice – Andrew Lundin Feb 07 '15 at 03:11
-3

You don't need underscore, just use built-in map:

var string = 'Hello World!';

document.write(string.split('').map(function(c) { return c.charCodeAt(); }));
DaAwesomeP
  • 585
  • 5
  • 18
  • 1
    This returns an array of 16-bit numbers representing the string as a sequence of UTF-16 code points. That’s not what the OP asked for, but at least it gets you part way there. – Daniel Cassidy Jul 13 '16 at 11:27