1

I was wondering why the string 'foo', when represented as a Buffer, but with different encodings was different?

Buffer.from('foo', 'utf-8') /* <Buffer 66 6f 6f> */

Buffer.from('foo', 'ascii') /* <Buffer 66 6f 6f> */

Buffer.from('foo', 'base64') /* <Buffer 7e 8a> */

Buffer.from('foo', 'utf16le') /* <Buffer 66 00 6f 00 6f 00> */

I probably don't understand buffers enough. Here's what I know about buffers:

A buffer is an area of memory.

It represents a fixed-size chunk of memory (can't be resized)

You can think of a buffer like an array of integers, which each represent a byte of data.

The way I understand it (in a very simplistic way), I know that we can only store the string foo as binary, and the character encoding is the way different kinds of data can be converted from whatever format into binary.

My question now is, why does the character encoding change the result of the buffer?

nkhil
  • 1,452
  • 1
  • 18
  • 37
  • Simple Google search reveals that utf8 is represented by 1-4 bytes(variable length, 1 byte in this case), acscii is 1 byte(fixed length), base64(not simply explainable), utf16le is represented by 2 or 4 bytes(variable length, 2 bytes in this case). – Molda Oct 06 '21 at 14:03
  • @Molda Thanks, but this is still not clear for me. Can you explain how a `Buffer` uses the encoding in the process of creating a buffer? – nkhil Oct 06 '21 at 18:15
  • i'm not sure how exactly it works but it could be as simple as this: it has a map of characters for each of the encodings, for example `var utf8 = { ..., 'f': 0x66, ..., 'o': 0x6f, ...}` then it allocates some memory and writes the bytes one after another into that memory. So 66 for the f and twice 6f for the o, then you get 0x666f6f, which is 11001100110111101101111 binary. This is very simple explanation and in reality it's more complex but i'm sure there's no magic. – Molda Oct 06 '21 at 21:14

1 Answers1

1

Because javascript internally has strings encoded using the [deprecated] Unicode encoding UCS-2. An encoding is a way of mapping glyphs to Unicode code points.

  • US-ASCII represent the first 128 ASCII characters (0x00–0x7F) as a single octet (byte). Anything outside that range cannot be represented. Because that's what ASCII is — though there are different ASCII flavors that use the high-order bit of the octet as a parity bit, some with even parity, others with odd parity.

  • UTF-8 is a way of encoding all the Unicode code points in 1–4 8-bit "code units". The first 128 code points (US-ASCII, U+0000–U+007F) get a single octet; the next 1,920 (U+0080–U+07FF) require 2 octets to encode, and so on.

  • UTF-16 is similar, but used 2-octet code units, so every Unicode code point (character, glyph) will occupy at least 2 octets (1 code unit). This also introduces the notion of byte order (are we big-endian, or little-endian?), so any UTF-16 string must be prefixed with the Unicode BOM (byte order mark), so the shortest possible UTF-16 encoding of a single character string would be 4 octets (2 for the BOM, and 2 for the single ASCII code point).

  • Base 64 is a way of encoding a sequence of random octets for [safe] transmission over the wire.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135