Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

Question

I'm using the Javascript window.atob() function to decode a base64-encoded string (specifically the base64-encoded content from the GitHub API). Problem is I'm getting ASCII-encoded characters back (like â¢ instead of ™). How can I properly handle the incoming base64-encoded stream so that it's decoded as utf-8?

The MDN page you linked has a paragraph starting with the phrase "For use with Unicode or UTF-8 strings,". — Pointy, May 07 '15 at 16:16

brandonscript · Accepted Answer · 2023-06-25T22:44:06.047

499

The Unicode Problem

Though JavaScript (ECMAScript) has matured, the fragility of Base64, ASCII, and Unicode encoding has caused a lot of headaches (much of it is in this question's history).

Consider the following example:

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); //   61: occupies < 1 byte

const notOK = "✓"
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok));    // YQ==
console.log(btoa(notOK)); // error

Why do we encounter this?

Base64, by design, expects binary data as its input. In terms of JavaScript strings, this means strings in which each character occupies only one byte. So if you pass a string into btoa() containing characters that occupy more than one byte, you will get an error, because this is not considered binary data.

Source: MDN (2021)

The original MDN article also covered the broken nature of window.btoa and .atob, which have since been mended in modern ECMAScript. The original, now-dead MDN article explained:

The "Unicode Problem" Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a UTF-8 string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte (0x00~0xFF).

Solution with binary interoperability

(Keep scrolling for the ASCII base64 solution)

Source: MDN (2021)

The solution recommended by MDN is to actually encode to and from a binary string representation:

Encoding UTF-8 ⇢ binary

// convert a UTF-8 string to a string in which
// each 16-bit unit occupies only one byte
function toBinary(string) {
  const codeUnits = new Uint16Array(string.length);
  for (let i = 0; i < codeUnits.length; i++) {
    codeUnits[i] = string.charCodeAt(i);
  }
  return btoa(String.fromCharCode(...new Uint8Array(codeUnits.buffer)));
}

// a string that contains characters occupying > 1 byte
let encoded = toBinary("✓ à la mode") // "EycgAOAAIABsAGEAIABtAG8AZABlAA=="

Decoding binary ⇢ UTF-8

function fromBinary(encoded) {
  const binary = atob(encoded);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

// our previous Base64-encoded string
let decoded = fromBinary(encoded) // "✓ à la mode"

Where this fails a little, is that you'll notice the encoded string EycgAOAAIABsAGEAIABtAG8AZABlAA== no longer matches the previous solution's string 4pyTIMOgIGxhIG1vZGU=. This is because it is a binary-encoded native JavaScript string, not a UTF8-encoded string. If this doesn't matter to you (i.e., you aren't converting strings represented in Unicode from another system or are fine with JavaScript's native UTF-16LE encoding), then you're good to go. If, however, you want to preserve the UTF-8 functionality, you're better off using the solution described below.

Solution with ASCII base64 interoperability

The entire history of this question shows just how many different ways we've had to work around broken encoding systems over the years. Though the original MDN article no longer exists, this solution is still arguably a better one, and does a great job of solving "The Unicode Problem" while maintaining plain text base64 strings that you can decode on, say, base64decode.org.

There are two possible methods to solve this problem:

the first one is to escape the whole string (see encodeURIComponent) and then encode it;

the second one is to convert the UTF-16 DOMString to an unsigned 8-bit integer array (Uint8Array) of characters and then encode it.

A note on previous solutions: the MDN article originally suggested using unescape and escape to solve the Character Out Of Range exception problem, but they have since been deprecated. Some other answers here have suggested working around this with decodeURIComponent and encodeURIComponent, this has proven to be unreliable and unpredictable. The most recent update to this answer uses modern JavaScript functions to improve speed and modernize code.

If you're trying to save yourself some time, you could also consider using a library:

js-base64 (NPM, great for Node.js)
base64-js

Encoding UTF-8 ⇢ base64

function b64EncodeUnicode(str) {
    // first we use encodeURIComponent to get percent-encoded Unicode,
    // then we convert the percent encodings into raw bytes which
    // can be fed into btoa.
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
        function toSolidBytes(match, p1) {
            return String.fromCharCode('0x' + p1);
    }));
}

b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

Decoding base64 ⇢ UTF-8

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
b64DecodeUnicode('Cg=='); // "\n"

(Why do we need to do this? ('00' + c.charCodeAt(0).toString(16)).slice(-2) prepends a 0 to single character strings, for example, when c == \n, the c.charCodeAt(0).toString(16) returns a, forcing a to be represented as 0a).

TypeScript support

Here's the same solution with some additional TypeScript compatibility (via @MA-Maddin):

// Encoding UTF-8 ⇢ base64

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode(parseInt(p1, 16))
    }))
}

// Decoding base64 ⇢ UTF-8

function b64DecodeUnicode(str) {
    return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
    }).join(''))
}

The first solution (deprecated)

This used escape and unescape (which are now deprecated, though this still works in all modern browsers):

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

And one last thing: I first encountered this problem when calling the GitHub API. To get this to work on (Mobile) Safari properly, I actually had to strip all white space from the base64 source before I could even decode the source. Whether or not this is still relevant in 2021, I don't know:

function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}

edited Jun 25 '23 at 22:44

answered May 07 '15 at 16:16

brandonscript

68,675
32
163
220

1

http://www.w3schools.com/jsref/jsref_unescape.asp "The unescape() function was deprecated in JavaScript version 1.5. Use decodeURI() or decodeURIComponent() instead." – Tedd Hansen Feb 17 '16 at 06:30
2

**Update:** Solution #1 in MDN's [The "Unicode Problem"](https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem) was fixed, `b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU=');` now correctly output "✓ à la mode" – weeix Jun 13 '16 at 06:57
2

Another way to decode would be `decodeURIComponent(atob('4pyTIMOgIGxhIG1vZGU=').split('').map(x => '%' + x.charCodeAt(0).toString(16)).join(''))` Not the most performant code, but it is what it is. – daniel.gindi Oct 05 '16 at 11:35
5

`return String.fromCharCode(parseInt(p1, 16));` to have TypeScript compatibility. – Martin Schneider Jul 06 '17 at 09:51
1

I have same issue can you please check it https://jsfiddle.net/parthjasani/hz5713b0/2/ – Parth Jasani Jan 03 '18 at 13:01
1

It seems the MDN article has been updated and the explanation has now been moved here: https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/btoa#Unicode_strings – Oliver Joseph Ash Nov 20 '20 at 20:21
Why does this line first prepend '00' and then picks the last two chars? '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2) Why doesn't it simply do this? '%' + c.charCodeAt(0).toString(16) – Milad Mar 19 '21 at 09:36
1

@brandonscript appending '00' and slicing was a bit confusing (for me and @Milad), so I added a comment about that on answer. can you take a look at it? – yaya Mar 19 '21 at 22:33
1

Yeah I’m not sure either. And this keeps getting changed by ECMAScript and Mozilla, so the answer keeps changing. I don’t even see this solution in the link anymore. I’ll look at modernizing this answer since it’s such a high-traffic result. – brandonscript Mar 20 '21 at 01:31
Thank you @yaya, I didn’t realize it was left padding! I wish IE supported str.padStart(2, '0'). Question: would the solution be functionally equivalent with that? – Milad Mar 21 '21 at 07:46
@Milad I think they are same. the only difference is that `.slice(-2)` makes sure that max length is 2 (for example it converts `35f` to `5f`), but `.padStart(2, '0')` doesn't. but I think all utf-8 encoded characters are less than `ff`, otherwise, this solution wasn't totally correct. but to make sure it doesn't throw an error for unsupported outranged characters (if there are any), it's safer to not change it. – yaya Mar 21 '21 at 09:48
@yaya in theory, even if the original characters occupied more than 1 byte (e.g., 'à'), when translated to base64 they become 1-byte character sequences (since base64 produces ascii chars), so it should never be longer than 2 characters - for example: 'à' -[base64]-> 'w6A=' -[atob]-> 'Ã ' -[%encoding]-> '%c3%a0' -[decodeURIComponent]-> 'à' – Milad Mar 21 '21 at 12:03
@brandonscript thanks, but honestly i don't like the edit at all. previously there was a straightforward encode and decode function (2018-2021), but the 2021 solution doesn't have it, and it seems detailed and so long for busy developers. (I didn't understand it also.). – yaya Apr 01 '21 at 15:16
1

@yaya I hate it too, tbh. The trouble is that the information provided was conflicting with MDN, and there are some reasonably good reasons why. I will continue to monitor the sentiment of this answer and update it to make sure it's as helpful as possible. I did just make some minor changes to improve the heading, and hopefully explain better why there are two answers now. – brandonscript Apr 01 '21 at 15:56
@brandonscript thanks. i read it again and i get it know. the confusing part for me is that the first block of code doesn't contain the solution, and the first block of description also doesn't contain the solution. so maybe you can format it like : 1. (code description) first convert it to binary, then decode it. 2. code solution, the btoa(fromBinary(...)) code. 3. describing the problem and the code that describes the problem. or something like this. (it's just a suggestion, please don't apply it if you don't like it.) – yaya Apr 01 '21 at 16:05
See what you think now. Good ideas. – brandonscript Apr 01 '21 at 16:19
@brandonscript Sorry for the late reply. (it's your post, so when you don't mention me with @, I don't get any notifications.). I think that's well-formatted now. the only problem is that `Encoding UTF8 ⇢ binary` part doesn't contain the usage code (`let encoded = btoa(toBinary("✓ à la mode"))`). – yaya Apr 02 '21 at 07:26
@brandonscript and also maybe changing `btoa(toBinary("✓"))` to a single function Is more cool, like : `binaryEncode("✓")`. (just like base64 version) – yaya Apr 02 '21 at 07:30
@Brandonscript thanks, now the only concern is the function name. I'm not sure but shouldn't it be like : `b64BinaryEncode`? (since you combined the toBinary and btoa). I'm not sure about it however. – yaya Apr 03 '21 at 17:48
For 4 byte Unitcode character support you can use `Uint32Array` and `from/toCodePoint` instead of `Uint16Array` and `from/toCharCodeAt`. Also note that in some browsers there's a limit to the number of arguments you can pass to `String.fromCodePoint` so might not work for very long strings. – apokryfos Dec 07 '22 at 09:57
The headings "Encoding UTF8 ⇢ binary" and "Decoding binary ⇢ UTF-8" under "binary interoperability" are incorrect. The first of these takes a UTF-16 string (Javascript's native representation), encodes that string as individual UTF-16 code units, takes the binary representation of these code units, encodes each byte as an individual character in a binary string, and converts this binary string to base64 using btoa(). The second example does the reverse. At no point is UTF-8 involved. I corrected this in an edit but it has been reverted by the original author (improperly, I believe). – Adrian Lopez Jun 17 '23 at 21:34
Indeed, it should be easy enough to confirm that the so-called "UTF-8 ⇢ binary" conversion above is actually a UTF-16 to binary conversion by entering the supplied input text "✓ à la mode" into base64encode.org and choosing UTF-16LE as the destination character set. The output will be the same as for the mislabeled example. Likewise, typing "EycgAOAAIABsAGEAIABtAG8AZABlAA==" into base64decode.org and choosing UTF-16LE as the source character set will give back the original string. Therefore, the first example's base64 output is actually a binary encoding of a UTF-16LE string, not UTF-8. – Adrian Lopez Jun 17 '23 at 21:54
As further confirmation, please compare the results produced by the Javascript examples above with a couple of examples in Python: `base64.b64encode('✓ à la mode'.encode('utf-8'))` will produce the base64 string `'4pyTIMOgIGxhIG1vZGU='`, while `base64.b64encode('✓ à la mode'.encode('utf-16le'))` will produce the base64 string '`EycgAOAAIABsAGEAIABtAG8AZABlAA=='`. – Adrian Lopez Jun 18 '23 at 08:00
@AdrianLopez Doesn't that mean that the code encodes UTF-8? Both base64encode.org and Python agree that `'✓ à la mode'` base64 encoded with UTF-8 is `'4pyTIMOgIGxhIG1vZGU='`. That's what OP's code outputs too. – Michael M. Jun 25 '23 at 18:08
@brandonscript, thank you for explaining your rationale. I won't attempt any further edits, but what I'm trying to get across here is that JavaScript strings are UTF-16 to begin with, which is why calling `charCodeAt()` on the characters in the string will give you a series of UTF-16 code units (one for each character) which the `toBinary()` function then converts to a binary string (still in UTF-16). In reverse, it's also why `fromBinary()` needs to reinterpret the `Uint8Array` as an`Uint16Array` before calling `fromCharCode()`, because the values are UTF-16 each split into two bytes. – Adrian Lopez Jun 26 '23 at 23:10
1

@brandonscript, A key observation is the fact that Base64 deals with arbitrary 8-bit values while JavaScript deals with 16-bit code units. You can split a UTF-16 string into 8-bit bytes and encode that as Base64, but it's still a UTF-16 string. To produce something in JavaScript like what you're getting from GitHub you must first convert from UTF-16 to UTF-8 and only then encode the resulting 8-bit code units as Base64. So, for encoding, it's UTF-16 to UTF-8 to Base64, while for decoding it's Base64 to UTF-8 to UTF-16. I've made an attempt to explain this in my own answer to your question. – Adrian Lopez Jun 26 '23 at 23:37
@MichaelM. the function `b64EncodeUnicode()` does indeed encode UTF-8 into Base64. What I'm getting at is that the function `toBinary()` that appears above it encodes UTF-16 into Base64. The string '`✓ à la mode`' Base64 encoded with UTF-16LE is `EycgAOAAIABsAGEAIABtAG8AZABlAA==`, same as what the OP's `toBinary()` function outputs. – Adrian Lopez Jun 26 '23 at 23:50
@AdrianLopez great! That works much better as an answer, upvoted. I'm going to delete my thread of comments here to keep this clean, might be worth it for you to do the same for anything that is already captured in your answer. – brandonscript Jun 28 '23 at 14:17

Tedd Hansen · Answer 2 · 2017-09-04T08:53:03.540

37

Things change. The escape/unescape methods have been deprecated.

You can URI encode the string before you Base64-encode it. Note that this does't produce Base64-encoded UTF8, but rather Base64-encoded URL-encoded data. Both sides must agree on the same encoding.

See working example here: http://codepen.io/anon/pen/PZgbPW

// encode string
var base64 = window.btoa(encodeURIComponent('€ 你好 æøåÆØÅ'));
// decode string
var str = decodeURIComponent(window.atob(tmp));
// str is now === '€ 你好 æøåÆØÅ'

For OP's problem a third party library such as js-base64 should solve the problem.

edited Sep 04 '17 at 08:53

answered Feb 17 '16 at 07:03

Tedd Hansen

12,074
14
61
97

1

I'd like to point out that you're not producing the base64 of the input string, but of his encoded component. So if you send it away the other party cannot decode it as "base64" and get the original string – Riccardo Galli Apr 07 '17 at 06:10
3

You are correct, I have updated the text to point that out. Thanks. The alternative seems to be implementing base64 yourself, using a third party library (such as js-base64) or receiving "Error: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range." – Tedd Hansen Sep 04 '17 at 08:50

score 34 · Answer 3 · answered Nov 09 '20 at 13:09

Decoding base64 to UTF8 String

Below is current most voted answer by @brandonscript

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

Above code can work, but it's very slow. If your input is a very large base64 string, for example 30,000 chars for a base64 html document. It will need lots of computation.

Here is my answer, use built-in TextDecoder, nearly 10x faster than above code for large input.

function decodeBase64(base64) {
    const text = atob(base64);
    const length = text.length;
    const bytes = new Uint8Array(length);
    for (let i = 0; i < length; i++) {
        bytes[i] = text.charCodeAt(i);
    }
    const decoder = new TextDecoder(); // default is utf-8
    return decoder.decode(bytes);
}

This is actually a pretty cool solution. I think it wouldn't have worked in the past, because atob and btoa were broken, but now they're not. — brandonscript, Apr 01 '21 at 16:20
This is also about 7x faster than the one-liner from the other answer: `new TextDecoder().decode(Uint8Array.from(atob(b64), c => c.charCodeAt(0)))` — geekley, May 12 '23 at 06:25
Thank you this solution also works with ```atob``` from ```react-native-quick-base64``` — Huan Huynh, Aug 17 '23 at 09:04

score 25 · Answer 4 · answered Jun 23 '20 at 14:51

25

The complete article that works for me: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding

The part where we encode from Unicode/UTF-8 is

function utf8_to_b64( str ) {
   return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
   return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

This is one of the most used methods nowadays.

answered Jun 23 '20 at 14:51

Enrike

367
3
7

2

Works for me as I am trying to decode Github API response which contains German umlaut. Thank you!! – Khanh Hua Sep 22 '20 at 20:01
unescape seems about to become deprecated https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape – ZalemCitizen Dec 29 '20 at 16:48

score 23 · Answer 5 · answered Apr 07 '17 at 06:28

23

If treating strings as bytes is more your thing, you can use the following functions

function u_atob(ascii) {
    return Uint8Array.from(atob(ascii), c => c.charCodeAt(0));
}

function u_btoa(buffer) {
    var binary = [];
    var bytes = new Uint8Array(buffer);
    for (var i = 0, il = bytes.byteLength; i < il; i++) {
        binary.push(String.fromCharCode(bytes[i]));
    }
    return btoa(binary.join(''));
}


// example, it works also with astral plane characters such as ''
var encodedString = new TextEncoder().encode('✓');
var base64String = u_btoa(encodedString);
console.log('✓' === new TextDecoder().decode(u_atob(base64String)))

answered Apr 07 '17 at 06:28

Riccardo Galli

12,419
6
64
62

1

Thanks. Your answer was crucial in helping me get this working, which took me many hours over multiple days. +1. https://stackoverflow.com/a/51814273/470749 – Ryan Aug 13 '18 at 01:52
For a much faster and more cross-browser solution (but essentially the same output), please see https://stackoverflow.com/a/53433503/5601591 – Jack G Apr 15 '20 at 19:17
u_atob and u_btoa use functions available in every browser since IE10 (2012), looks solid to me (if you refer to TextEncoder, that's just an example) – Riccardo Galli Apr 18 '20 at 21:30
Exactly what I needed. My base64 encoded UTF-8 strings come from a Python script (`base64.b64encode`) and this makes it work with UTF-8 characters without changing anything on the Python side. Works like a charm! – Elpy Jan 16 '22 at 14:08

score 6 · Answer 6 · answered Oct 04 '18 at 13:02

Here is 2018 updated solution as described in the Mozilla Development Resources

TO ENCODE FROM UNICODE TO B64

function b64EncodeUnicode(str) {
    // first we use encodeURIComponent to get percent-encoded UTF-8,
    // then we convert the percent encodings into raw bytes which
    // can be fed into btoa.
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
        function toSolidBytes(match, p1) {
            return String.fromCharCode('0x' + p1);
    }));
}

b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

TO DECODE FROM B64 TO UNICODE

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
b64DecodeUnicode('Cg=='); // "\n"

if i use b64EncodeUnicode(str) function in Javascript. How to Decode it in PHP? Can you convert function b64DecodeUnicode(str) to PHP function ? — Duc Manh Nguyen, Aug 04 '21 at 09:43

Jack G · Answer 7 · 2019-04-14T20:42:24.963

I would assume that one might want a solution that produces a widely useable base64 URI. Please visit data:text/plain;charset=utf-8;base64,4pi44pi54pi64pi74pi84pi+4pi/ to see a demonstration (copy the data uri, open a new tab, paste the data URI into the address bar, then press enter to go to the page). Despite the fact that this URI is base64-encoded, the browser is still able to recognize the high code points and decode them properly. The minified encoder+decoder is 1058 bytes (+Gzip→589 bytes)

!function(e){"use strict";function h(b){var a=b.charCodeAt(0);if(55296<=a&&56319>=a)if(b=b.charCodeAt(1),b===b&&56320<=b&&57343>=b){if(a=1024*(a-55296)+b-56320+65536,65535<a)return d(240|a>>>18,128|a>>>12&63,128|a>>>6&63,128|a&63)}else return d(239,191,189);return 127>=a?inputString:2047>=a?d(192|a>>>6,128|a&63):d(224|a>>>12,128|a>>>6&63,128|a&63)}function k(b){var a=b.charCodeAt(0)<<24,f=l(~a),c=0,e=b.length,g="";if(5>f&&e>=f){a=a<<f>>>24+f;for(c=1;c<f;++c)a=a<<6|b.charCodeAt(c)&63;65535>=a?g+=d(a):1114111>=a?(a-=65536,g+=d((a>>10)+55296,(a&1023)+56320)):c=0}for(;c<e;++c)g+="\ufffd";return g}var m=Math.log,n=Math.LN2,l=Math.clz32||function(b){return 31-m(b>>>0)/n|0},d=String.fromCharCode,p=atob,q=btoa;e.btoaUTF8=function(b,a){return q((a?"\u00ef\u00bb\u00bf":"")+b.replace(/[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g,h))};e.atobUTF8=function(b,a){a||"\u00ef\u00bb\u00bf"!==b.substring(0,3)||(b=b.substring(3));return p(b).replace(/[\xc0-\xff][\x80-\xbf]*/g,k)}}(""+void 0==typeof global?""+void 0==typeof self?this:self:global)

Below is the source code used to generate it.

var fromCharCode = String.fromCharCode;
var btoaUTF8 = (function(btoa, replacer){"use strict";
    return function(inputString, BOMit){
        return btoa((BOMit ? "\xEF\xBB\xBF" : "") + inputString.replace(
            /[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g, replacer
        ));
    }
})(btoa, function(nonAsciiChars){"use strict";
    // make the UTF string into a binary UTF-8 encoded string
    var point = nonAsciiChars.charCodeAt(0);
    if (point >= 0xD800 && point <= 0xDBFF) {
        var nextcode = nonAsciiChars.charCodeAt(1);
        if (nextcode !== nextcode) // NaN because string is 1 code point long
            return fromCharCode(0xef/*11101111*/, 0xbf/*10111111*/, 0xbd/*10111101*/);
        // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
        if (nextcode >= 0xDC00 && nextcode <= 0xDFFF) {
            point = (point - 0xD800) * 0x400 + nextcode - 0xDC00 + 0x10000;
            if (point > 0xffff)
                return fromCharCode(
                    (0x1e/*0b11110*/<<3) | (point>>>18),
                    (0x2/*0b10*/<<6) | ((point>>>12)&0x3f/*0b00111111*/),
                    (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
                    (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
                );
        } else return fromCharCode(0xef, 0xbf, 0xbd);
    }
    if (point <= 0x007f) return nonAsciiChars;
    else if (point <= 0x07ff) {
        return fromCharCode((0x6<<5)|(point>>>6), (0x2<<6)|(point&0x3f));
    } else return fromCharCode(
        (0xe/*0b1110*/<<4) | (point>>>12),
        (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
        (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
    );
});

Then, to decode the base64 data, either HTTP get the data as a data URI or use the function below.

var clz32 = Math.clz32 || (function(log, LN2){"use strict";
    return function(x) {return 31 - log(x >>> 0) / LN2 | 0};
})(Math.log, Math.LN2);
var fromCharCode = String.fromCharCode;
var atobUTF8 = (function(atob, replacer){"use strict";
    return function(inputString, keepBOM){
        inputString = atob(inputString);
        if (!keepBOM && inputString.substring(0,3) === "\xEF\xBB\xBF")
            inputString = inputString.substring(3); // eradicate UTF-8 BOM
        // 0xc0 => 0b11000000; 0xff => 0b11111111; 0xc0-0xff => 0b11xxxxxx
        // 0x80 => 0b10000000; 0xbf => 0b10111111; 0x80-0xbf => 0b10xxxxxx
        return inputString.replace(/[\xc0-\xff][\x80-\xbf]*/g, replacer);
    }
})(atob, function(encoded){"use strict";
    var codePoint = encoded.charCodeAt(0) << 24;
    var leadingOnes = clz32(~codePoint);
    var endPos = 0, stringLen = encoded.length;
    var result = "";
    if (leadingOnes < 5 && stringLen >= leadingOnes) {
        codePoint = (codePoint<<leadingOnes)>>>(24+leadingOnes);
        for (endPos = 1; endPos < leadingOnes; ++endPos)
            codePoint = (codePoint<<6) | (encoded.charCodeAt(endPos)&0x3f/*0b00111111*/);
        if (codePoint <= 0xFFFF) { // BMP code point
          result += fromCharCode(codePoint);
        } else if (codePoint <= 0x10FFFF) {
          // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
          codePoint -= 0x10000;
          result += fromCharCode(
            (codePoint >> 10) + 0xD800,  // highSurrogate
            (codePoint & 0x3ff) + 0xDC00 // lowSurrogate
          );
        } else endPos = 0; // to fill it in with INVALIDs
    }
    for (; endPos < stringLen; ++endPos) result += "\ufffd"; // replacement character
    return result;
});

The advantage of being more standard is that this encoder and this decoder are more widely applicable because they can be used as a valid URL that displays correctly. Observe.

(function(window){
    "use strict";
    var sourceEle = document.getElementById("source");
    var urlBarEle = document.getElementById("urlBar");
    var mainFrameEle = document.getElementById("mainframe");
    var gotoButton = document.getElementById("gotoButton");
    var parseInt = window.parseInt;
    var fromCodePoint = String.fromCodePoint;
    var parse = JSON.parse;
    
    function unescape(str){
        return str.replace(/\\u[\da-f]{0,4}|\\x[\da-f]{0,2}|\\u{[^}]*}|\\[bfnrtv"'\\]|\\0[0-7]{1,3}|\\\d{1,3}/g, function(match){
          try{
            if (match.startsWith("\\u{"))
              return fromCodePoint(parseInt(match.slice(2,-1),16));
            if (match.startsWith("\\u") || match.startsWith("\\x"))
              return fromCodePoint(parseInt(match.substring(2),16));
            if (match.startsWith("\\0") && match.length > 2)
              return fromCodePoint(parseInt(match.substring(2),8));
            if (/^\\\d/.test(match)) return fromCodePoint(+match.slice(1));
          }catch(e){return "\ufffd".repeat(match.length)}
          return parse('"' + match + '"');
        });
    }
    
    function whenChange(){
      try{ urlBarEle.value = "data:text/plain;charset=UTF-8;base64," + btoaUTF8(unescape(sourceEle.value), true);
      } finally{ gotoURL(); }
    }
    sourceEle.addEventListener("change",whenChange,{passive:1});
    sourceEle.addEventListener("input",whenChange,{passive:1});
    
    // IFrame Setup:
    function gotoURL(){mainFrameEle.src = urlBarEle.value}
    gotoButton.addEventListener("click", gotoURL, {passive: 1});
    function urlChanged(){urlBarEle.value = mainFrameEle.src}
    mainFrameEle.addEventListener("load", urlChanged, {passive: 1});
    urlBarEle.addEventListener("keypress", function(evt){
      if (evt.key === "enter") evt.preventDefault(), urlChanged();
    }, {passive: 1});
    
        
    var fromCharCode = String.fromCharCode;
    var btoaUTF8 = (function(btoa, replacer){
      "use strict";
        return function(inputString, BOMit){
         return btoa((BOMit?"\xEF\xBB\xBF":"") + inputString.replace(
          /[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g, replacer
      ));
     }
    })(btoa, function(nonAsciiChars){
  "use strict";
     // make the UTF string into a binary UTF-8 encoded string
     var point = nonAsciiChars.charCodeAt(0);
     if (point >= 0xD800 && point <= 0xDBFF) {
      var nextcode = nonAsciiChars.charCodeAt(1);
      if (nextcode !== nextcode) { // NaN because string is 1code point long
       return fromCharCode(0xef/*11101111*/, 0xbf/*10111111*/, 0xbd/*10111101*/);
      }
      // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
      if (nextcode >= 0xDC00 && nextcode <= 0xDFFF) {
       point = (point - 0xD800) * 0x400 + nextcode - 0xDC00 + 0x10000;
       if (point > 0xffff) {
        return fromCharCode(
         (0x1e/*0b11110*/<<3) | (point>>>18),
         (0x2/*0b10*/<<6) | ((point>>>12)&0x3f/*0b00111111*/),
         (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
         (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
        );
       }
      } else {
       return fromCharCode(0xef, 0xbf, 0xbd);
      }
     }
     if (point <= 0x007f) { return inputString; }
     else if (point <= 0x07ff) {
      return fromCharCode((0x6<<5)|(point>>>6), (0x2<<6)|(point&0x3f/*00111111*/));
     } else {
      return fromCharCode(
       (0xe/*0b1110*/<<4) | (point>>>12),
       (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
       (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
      );
     }
    });
    setTimeout(whenChange, 0);
})(window);

img:active{opacity:0.8}

<center>
<textarea id="source" style="width:66.7vw">Hello \u1234 W\186\0256ld!
Enter text into the top box. Then the URL will update automatically.
</textarea><br />
<div style="width:66.7vw;display:inline-block;height:calc(25vw + 1em + 6px);border:2px solid;text-align:left;line-height:1em">
<input id="urlBar" style="width:calc(100% - 1em - 13px)" /><img id="gotoButton" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABsAAAAeCAMAAADqx5XUAAAAclBMVEX///9NczZ8e32ko6fDxsU/fBoSQgdFtwA5pAHVxt+7vLzq5ex23y4SXABLiiTm0+/c2N6DhoQ6WSxSyweVlZVvdG/Uz9aF5kYlbwElkwAggACxs7Jl3hX07/cQbQCar5SU9lRntEWGum+C9zIDHwCGnH5IvZAOAAABmUlEQVQoz7WS25acIBBFkRLkIgKKtOCttbv//xdDmTGZzHv2S63ltuBQQP4rdRiRUP8UK4wh6nVddQwj/NtDQTvac8577zTQb72zj65/876qqt7wykU6/1U6vFEgjE1mt/5LRqrpu7oVsn0sjZejMfxR3W/yLikqAFcUx93YxLmZGOtElmEu6Ufd9xV3ZDTGcEvGLbMk0mHHlUSvS5svCwS+hVL8loQQyfpI1Ay8RF/xlNxcsTchGjGDIuBG3Ik7TMyNxn8m0TSnBAK6Z8UZfp3IbAonmJvmsEACum6aNv7B0CnvpezDcNhw9XWsuAr7qnRg6dABmeM4dTgn/DZdXWs3LMspZ1KDMt1kcPJ6S1icWNp2qaEmjq6myx7jbQK3VKItLJaW5FR+cuYlRhYNKzGa9vF4vM5roLW3OSVjkmiGJrPhUq301/16pVKZRGFYWjTP50spTxBN5Z4EKnSonruk+n4tUokv1aJSEl/MLZU90S3L6/U6o0J142iQVp3HcZxKSo8LfkNRCtJaKYFSRX7iaoAAUDty8wvWYR6HJEepdwAAAABJRU5ErkJggg==" style="width:calc(1em + 4px);line-height:1em;vertical-align:-40%;cursor:pointer" />
<iframe id="mainframe" style="width:66.7vw;height:25vw" frameBorder="0"></iframe>
</div>
</center>

In addition to being very standardized, the above code snippets are also very fast. Instead of an indirect chain of succession where the data has to be converted several times between various forms (such as in Riccardo Galli's response), the above code snippet is as direct as performantly possible. It uses only one simple fast String.prototype.replace call to process the data when encoding, and only one to decode the data when decoding. Another plus is that (especially for big strings), String.prototype.replace allows the browser to automatically handle the underlying memory management of resizing the string, leading a significant performance boost especially in evergreen browsers like Chrome and Firefox that heavily optimize String.prototype.replace. Finally, the icing on the cake is that for you latin script exclūsīvō users, strings which don't contain any code points above 0x7f are extra fast to process because the string remains unmodified by the replacement algorithm.

I have created a github repository for this solution at https://github.com/anonyco/BestBase64EncoderDecoder/

Can you elaborate on what you mean by "user-created way" vs. "interpretable by the browser"? What is the value-add of using this solution over, say, what Mozilla recommends? — brandonscript, Nov 22 '18 at 16:59
@brandonscript Mozilla is different from MDN. MDN is user-created content. The page on MDN that recommends your solution was user-created content, not browser vendor created content. — Jack G, Nov 22 '18 at 18:12
Is your solution vendor created? I’d so, I’d suggest giving credit to the origin. If not, then it is also user-created, and no different than MDN’s answer? — brandonscript, Nov 22 '18 at 18:54
@brandonscript Good point. You are correct. I removed that piece of text. Also, check out the demo I added. — Jack G, Nov 22 '18 at 20:38

score 2 · Answer 8 · edited Jan 16 '23 at 10:05

2

This is my one-liner solution combining Jackie Hans answer and some code from another question:

const utf8_encoded_text = new TextDecoder().decode(Uint8Array.from(window.atob(base_64_decoded_text).split("").map(x => x.charCodeAt(0))));

edited Jan 16 '23 at 10:05

swen

29
4

answered Nov 15 '22 at 16:43

Stephan Richter

1,139
11
31

score 1 · Answer 9 · answered Oct 31 '22 at 20:35

1

If trying to decode a Base64 representation of utf8 encoded data in node, you can use the native Buffer helper

Buffer.from("4pyTIMOgIGxhIG1vZGU=", "base64").toString(); // '✓ à la mode'

The toString method of Buffer defaults to utf8, but you can specify any desired encoding. For example, the reverse operation would look like this

Buffer.from('✓ à la mode', "utf8").toString("base64"); // "4pyTIMOgIGxhIG1vZGU="

answered Oct 31 '22 at 20:35

jbmilgrom

20,608
5
24
22

Buffer is a part of Node.js. Question is about JavaScript in general – Mikhail Yevchenko Mar 15 '23 at 08:14

score 1 · Answer 10 · answered Mar 25 '23 at 07:49

1

2023: There is still no built in support in browsers for encoding and decoding base64 to UTF8.

Unless you are really into reinventing the wheel and testing edge cases, for both browsers and Node use https://github.com/dankogai/js-base64.

answered Mar 25 '23 at 07:49

IvanD

2,728
14
26

Adrian Lopez · Answer 11 · 2023-06-19T22:07:18.277

The Binary String Concept

A problem with the functions btoa() and atob() is that they both operate on string values but the contents of these strings are different from what strings are normally expected to contain. Strings received by btoa(), for instance, are expected to be formatted as binary strings, which are array-like sequences in which each 16-bit character represents an 8-bit value. Every element in the string is expected to contain a value between 0 - 255, and character values outside that range are considered invalid. Values returned by atob() are formatted the same way. It would make more sense if these functions worked with byte arrays instead, but they both use strings.

Unicode strings in Javascript, by contrast, are stored as a series of UTF-16 code units where each code unit has a value between 0 - 65,535. Passing a Unicode string to btoa() will work correctly if the characters contained in the string all lie in the Latin1 range (0 - 255), but the call will fail otherwise. Its counterpart atob(), on the other hand, will take a Base64 formatted string and return a binary string without any regard to whether the contents represent a Latin1 string, a UTF-8 string, a UTF-16 string, or arbitrary binary data. This is by design.

Applying this to the specific example presented in the question, consider the UTF-8 and UTF-16 representations of the Unicode "Trade Mark Sign" character, ™. That character's UTF-8 representation is 0xE2 0x84 0xA2. The Base64 representation of this sequence is '4oSi'. Feeding '4oSi' to atob() will return a string consisting of three 16-bit values each representing one byte: 0x00E2, 0x0084, and 0x00A2. Interpreted as a binary string these values represent the UTF-8 sequence 0xE2, 0x84, 0xA2 (the original ™ character, as expected). Interpreted as an ordinary UTF-16 string, however, the sequence represents the string 'â\x84¢', which is what you're getting.

Encoding and Decoding Native Strings

Binary Encoding

Before we can convert a Unicode string to Base64 we need to decide on a binary encoding for that string. This can be UTF-8, UTF-16, or any other encoding that's able to represent the original string. We can write some functions to convert from native strings to binary strings for particular encodings:

Native String to UTF-8

function encodeAsUTF8(str) {
    const encoder = new TextEncoder();

    const utf8 = encoder.encode(str);

    var binaryString = '';
    for (let b = 0; b < utf8.length; ++b) {
        binaryString += String.fromCharCode(utf8[b]);
    }

    return binaryString;
}

Native String to UTF-16

function encodeAsUTF16(str) {
    var utf16 = new Uint16Array(str.length);

    for (let p = 0; p < utf16.length; ++p) {
        utf16[p] = str.charCodeAt(p);
    }

    const bytes = new Uint8Array(utf16.buffer);

    var binaryString = '';
    for (let b = 0; b < bytes.length; ++b) {
        binaryString += String.fromCharCode(bytes[b]);
    }

    return binaryString;        
}

Other encodings are possible, but the two above should suffice to illustrate the concept.

Decoding

Converting from a binary encoding to a native string requires knowing the source encoding so the binary values are correctly interpreted. Taking UTF-8 and UTF-16 as examples again, we can write functions to convert from UTF-8 and UTF-16 binary strings to native strings:

UTF-8 to Native String

function decodeUTF8(binary) {
    const bytes = new Uint8Array(binary.length);
    for (let b = 0; b < bytes.length; ++b) {
        bytes[b] = binary.charCodeAt(b);
    }

    const decoder = new TextDecoder('utf-8');

    return decoder.decode(bytes);
}

UTF-16 to Native String

function decodeUTF16(binary) {
    const utf16 = new Uint8Array(binary.length);
    for (let b = 0; b < utf16.length; ++b) {
        utf16[b] = binary.charCodeAt(b);
    }

    const decoder = new TextDecoder('utf-16');

    return decoder.decode(utf16);
}

Native String to Base64

With the various string encoding functions in place we can encode a string as UTF-8 and convert this in turn to Base64 by calling:

base64string = btoa(encodeAsUTF8('™'));

We can also encode a string as UTF-16 and convert this to Base64 by calling:

base64string = btoa(encodeAsUTF16('™'));

Base64 to Native String

To convert a UTF-8 encoded string from Base64 to a native string, call:

decodeUTF8(atob(base64string));

To convert a UTF-16 encoded string from Base64 to a native string, call:

decodeUTF16(atob(base64string));

score 0 · Answer 12 · answered Jan 18 '17 at 06:46

Here's some future-proof code for browsers that may lack escape/unescape(). Note that IE 9 and older don't support atob/btoa(), so you'd need to use custom base64 functions for them.

// Polyfill for escape/unescape
if( !window.unescape ){
    window.unescape = function( s ){
        return s.replace( /%([0-9A-F]{2})/g, function( m, p ) {
            return String.fromCharCode( '0x' + p );
        } );
    };
}
if( !window.escape ){
    window.escape = function( s ){
        var chr, hex, i = 0, l = s.length, out = '';
        for( ; i < l; i ++ ){
            chr = s.charAt( i );
            if( chr.search( /[A-Za-z0-9\@\*\_\+\-\.\/]/ ) > -1 ){
                out += chr; continue; }
            hex = s.charCodeAt( i ).toString( 16 );
            out += '%' + ( hex.length % 2 != 0 ? '0' : '' ) + hex;
        }
        return out;
    };
}

// Base64 encoding of UTF-8 strings
var utf8ToB64 = function( s ){
    return btoa( unescape( encodeURIComponent( s ) ) );
};
var b64ToUtf8 = function( s ){
    return decodeURIComponent( escape( atob( s ) ) );
};

A more comprehensive example for UTF-8 encoding and decoding can be found here: http://jsfiddle.net/47zwb41o/

score -1 · Answer 13 · answered Dec 08 '15 at 11:32

-1

Small correction, unescape and escape are deprecated, so:

function utf8_to_b64( str ) {
    return window.btoa(decodeURIComponent(encodeURIComponent(str)));
}

function b64_to_utf8( str ) {
     return decodeURIComponent(encodeURIComponent(window.atob(str)));
}


function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(encodeURIComponent(window.atob(str)));
}

answered Dec 08 '15 at 11:32

Darkves

352
2
10

2

Looks like the doc link is even different from this now, suggesting a regex solution to manage it. – brandonscript Dec 09 '15 at 04:19
3

This will not work, because `encodeURIComponent` is the inverse of `decodeURIComponent`, i.e. it will just undo the conversion. See http://stackoverflow.com/a/31412163/1534459 for a great explanation of what is happening with `escape` and `unescape`. – bodo Feb 01 '16 at 14:50
1

@canaaerus I don't understand your comment? escape and unescape are deprecated, I just swap them with [decode|encode]URIComponent function :-) Everything is work just fine. Read the question first – Darkves Feb 01 '16 at 17:21
1

@Darkves: The reason why `encodeURIComponent` is used, is to correctly handle (the whole range of) unicode strings. So e.g. `window.btoa(decodeURIComponent(encodeURIComponent('€')))` gives `Error: String contains an invalid character` because it’s the same as `window.btoa('€')` and `btoa` can not encode `€`. – bodo Feb 02 '16 at 13:47
1

No point in arguing this: http://codepen.io/anon/pen/NxmRmj gives "Uncaught InvalidCharacterError: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range." – Tedd Hansen Feb 17 '16 at 06:22
2

@Darkves: Yes, that's correct. But you can't swap escape with EncodeURIComponent and unescape with DecodeURIComponent, because the Encode and the escape methods don't do the same thing. Same with decode&unescape. I originally made the same mistake, btw. You should notice that if you take a string, UriEncode it, then UriDecode it, you get the same string back that you inputted. So doing that would be nonsense. When you unescape a string encoded with encodeURIComponent, you don't get the same string back that you inputted, so that's why with escape/unescape it works, but not with yours. – Stefan Steiger Jul 19 '16 at 18:55
1

@Stefan Steiger look at Tedd Hansen comment. I made a mistake, and I'm sry. HF commenting around :) – Darkves Jul 28 '16 at 12:55

score -1 · Answer 14 · answered Aug 08 '18 at 09:47

including above solution if still facing issue try as below, Considerign the case where escape is not supported for TS.

blob = new Blob(["\ufeff", csv_content]); // this will make symbols to appears in excel

for csv_content you can try like below.

function b64DecodeUnicode(str: any) {        
        return decodeURIComponent(atob(str).split('').map((c: any) => {
            return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
        }).join(''));
    }