1

I am trying to understand memory used by Strings & Arrays. As per this helpful question: How many bytes in a JavaScript string?

Blob is a great way of checking byte size of Strings: new Blob(['a']).size -> 1 byte

But Strings are encoded UTF-16 in JavaScript which uses minimum of 2 bytes. How does Blob return 1?

Furthermore, -----

const x = 200;
const y = 200;

const changes = []

for (let i=0;i<y;i++) {
    let subArr = []
    for (let j=0;j<x;j++) {
        subArr[j]= new Uint8Array(1)
    }
    changes[i]=subArr
}
console.log(new Blob(changes).size)

The array above consumes 79800 instead of 40000 (200*200 of Uint8Array(1)).

  1. Why does the array above consume double (79800) of what I expect (40000)? Also, why is the first index (0) interpreted as 1 byte and the following ones are 2 bytes ? Why is that?

'

for (let i=0;i<y;i++) {
   changes[i] = new Array(x).fill(new Uint8Array(1))
}
  1. If I fill the array using the above, it still consumes 79800. Why is that? As pointed out in comments, its the same Uint8Array object that gets filled x times.
Vishal
  • 111
  • 6
  • 1
    `Blob` stores strings in UTF-8, as demonstrated by `for await (const chunk of new Blob(['aß']).stream()) console.log(chunk);` which logs `Uint8Array(3) [ 97, 195, 159 ]`. – Heiko Theißen Aug 19 '23 at 09:11
  • The title has a different question than the body of your question. Blob and String are two different things. – trincot Aug 19 '23 at 12:00
  • @trincot Added clarity. – Vishal Aug 19 '23 at 12:24
  • But the body of your question is all about why you get certain outputs when creating blobs... The title is about S(t)rings and Arrays, not about Blobs. Also, you should ask 1 question only, not 4. – trincot Aug 19 '23 at 12:30
  • Note that `new Array(x).fill(new Uint8Array(1))` will create one single `Uint8Array` and fill a reference to that one-and-the-same array into all `x` elements of the new array. It doesn't create `x` `Uint8Array`s. – CherryDT Aug 19 '23 at 12:48
  • @CherryDT Good looking out, I changed that but it makes it even more interesting why it consumed 79800 bytes with 1 obj reference. How can I learn more about this issue? – Vishal Aug 19 '23 at 12:55
  • @Vishal see my updated answer for an explanation on the 79800 vs 40000 bytes issue. – Sergiu Paraschiv Aug 19 '23 at 15:34

1 Answers1

3

Blob uses UTF-8 to represent strings. The minimum byte size for UTF-8 is 1 and character 'a' can be represented in UTF-8 using a single byte. A two-byte UTF-8 character ('Ђ' for example) returns 2, and something even longer like complex emoji ('') returns 4.

Regarding the 79800 vs 40000 bytes example: you are not building an array of 40000 bytes and passing it to Blob. You are building an array of arrays of bytes. The "leaf" nodes of these arrays of arrays are indeed 40000 bytes, but that's not what you use to build the Blob...

The documentation is a bit vague, but helpful after you do some experimenting.

"The content of the blob consists of the concatenation of the values given in the parameter array."

The concatenation of the values. What does that mean? Concatenation is an operation on arrays, terminology mostly used to mean "join two strings". Well, let's do some experimenting:

await new Blob('a').text() resolves with 'a', await new Blob([new Uint8Array(1)]).text() resolves with '\x00', await new Blob([[new Uint8Array(1)]]).text(), which is closer to your example, resolves with '0'. Huh...that makes perfect sense, since new Uint8Array(1).toString() is '0' too.

await new Blob([[new Uint8Array(1),new Uint8Array(1)]]).text() resolves to '0,0', which also makes sense because [new Uint8Array(1),new Uint8Array(1)].toString() is '0,0' too.

This last one is the explanation basically. When you pass things which are not strings to Blob it automatically turns them to strings "for you".

And arrays turned to strings take up more than just the string representations of their elements because we also get commas between them.

Going back to your example again, you are passing 200 arrays of 200 Uint8Array(1) instances to Blob. Each one of the "inner" arrays is turned to a String, meaning it's 200 '0' characters plus 199 ',' characters. And (200 + 199) * 200 is, you guessed it, 79800!

The main lesson here is: whatever you pass to Blob is "strigified" first.

Sergiu Paraschiv
  • 9,929
  • 5
  • 36
  • 47
  • Should I be storing all strings in Blobs if my goal is to reduce memory consumption? Or is there even a better way than Blob? – Vishal Aug 19 '23 at 09:28
  • 1
    @Vishal that would make for very cumbersome code. How much text do you really need in memory at any given time? What is your application and why the concern? – Pointy Aug 19 '23 at 09:31
  • Do you only need to pass around this data? Will you ever need to do conversions? – Sergiu Paraschiv Aug 19 '23 at 09:32
  • @Pointy small hardware with limited few MB of memory – Vishal Aug 19 '23 at 09:33
  • @SergiuParaschiv Yes the data will be passed back and forth between client, servers & DB. No conversions and no chars over 1 byte. – Vishal Aug 19 '23 at 09:34
  • Well it's a trade-off. Whenever you want to do normal string operations, you're going to be converting from the UTF-8 Blob content into ordinary strings and then back again. – Pointy Aug 19 '23 at 09:38
  • I would fetch the data from a server, UTF-8 encoded, maybe store it as a Uint8Array, and then use something like `TextDecoder` to turn it into Strings – Sergiu Paraschiv Aug 19 '23 at 09:38
  • But you would still need to know the character boundaries in your byte arrays. Say you have some algorithm that splits a piece of text in two, you'd have to take special care not to split somewhere in the middle of a multiple-bytes character. – Sergiu Paraschiv Aug 19 '23 at 09:40
  • Just read your last comment: no chars over 1 byte makes things a lot easier. – Sergiu Paraschiv Aug 19 '23 at 09:41
  • @SergiuParaschiv Added a few more questions. Would appreciate your help. – Vishal Aug 19 '23 at 12:25