Sync storage, max quota bytes per item and chunked data storage

Question

I'm using a modified version of this code (Update: that answer has since been updated to use correct code, but this question still carries value since it contains relevant test cases and discussions for this problem) to store a single object after stringification in chunked keys inside of sync storage. Note that sync storage has a maximum quota size per item. So, I have those maxLengthPerItem and maxValueLength variables.

function lengthInUtf8Bytes(str) {
    // by: https://stackoverflow.com/a/5515960/2675672
    // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
    var m = encodeURIComponent(str).match(/%[89ABab]/g);
    return str.length + (m ? m.length : 0);
}

function syncStore(key, objectToStore, callback) {
    var jsonstr = JSON.stringify(objectToStore), i = 0, storageObj = {},
        // (note: QUOTA_BYTES_PER_ITEM only on sync storage)
        // subtract two for the quotes added by stringification     
        // extra -5 to err on the safe side
        maxBytesPerItem = chrome.storage.sync.QUOTA_BYTES_PER_ITEM - NUMBER, 
        // since the key uses up some per-item quota, use
        // "maxValueBytes" to see how much is left for the value
        maxValueBytes, index, segment, counter; 

    console.log("jsonstr length is " + lengthInUtf8Bytes(jsonstr));

    // split jsonstr into chunks and store them in an object indexed by `key_i`
    while(jsonstr.length > 0) {
        index = key + "_" + i++;
        maxValueBytes = maxBytesPerItem - lengthInUtf8Bytes(index);

        counter = maxValueBytes;
        segment = jsonstr.substr(0, counter);           
        while(lengthInUtf8Bytes(segment) > maxValueBytes)
            segment = jsonstr.substr(0, --counter);

        storageObj[index] = segment;
        jsonstr = jsonstr.substr(counter);
    }
    // later used by retriever function
    storageObj[key] = i;
    console.log((i + 1) + " keys used (= key + key_i)");
    // say user saves till chunk 20 in case I
    // in case II, user deletes several snippets and brings down
    // total no. of "required" chunks to 15; however, the previous chunks
    // (16-20) remain in memory unless they are "clear"ed.
    chrome.storage.sync.clear(function(){                       
        console.log(storageObj);
        console.log(chrome.storage.sync);
        chrome.storage.sync.set(storageObj, callback);          
    });
}

The problem is in this line:

maxLengthPerItem = chrome.storage.sync.QUOTA_BYTES_PER_ITEM - NUMBER,

The problem is that 5 is the minimum NUMBER for which there's no error. Here's the sample code you can use to test my theory:

var len = 102000,
    string = [...new Array(len)].map(x => 1).join(""),
    Data = {
        "my_text": string
    },
    key = "key";

syncStore(key, Data, function(){
    console.log(chrome.runtime.lastError && chrome.runtime.lastError.message);
});

Using 4 yields MAX_QUOTA_BYTES_PER_ITEM exceed error. You can yourself adjust the value of len (to 20000, 60000 < 102000, etc.) to check my theory.

Question:

Why is the current method requiring exactly 5 as the minimum value? I know there's two quotes for stringification, but what about the other 3 characters? Where'd they come from?

Additionally, I've noticed that in textual Data like this one, even 5 does not work. In the specific case above, minimum NUMBER required is 6.

Clarification:

The point of my question is not what are the other means to store data in sync.

The point of my question is why is the current method requiring exactly 5 (And why that textual data requires a 6.) Imho, my question is very specific and surely does not deserve a close vote.

Update: I've added new code which stores data based on measurement of length of UTF-8 bytes, but it still does not provide desirable results. I've also added code to more easily test my theory.

Try compressing the data with [LZ-string](https://github.com/pieroxy/lz-string). — wOxxOm, Jan 23 '17 at 12:17
@wOxxOm Hmmm. I checked it out. It seems like it's about compressing very large strings. My problem, however, is that my function above, which stores stringified data in separate items since the sync storage has a max quota per item, is giving an error erratically. — Gaurang Tandon, Jan 23 '17 at 14:19
I think that what @wOxxOm was, more or less, trying to get at is that you should `JSON.stringify()` your data, then compress the JSON'ed data using LZ-string, verify that the length is small enough (re-chunk if not) then store it in `storage.sync`; or, better, compress it, then chunk it (chunking the zipped data) and store. Another alternative is to just set your chunk length small enough such that you (almost) never have a problem: just use more chunks. — Makyen, Jan 23 '17 at 18:38
@Makyen I agree with your comment. However, the point of my question is not what are the _other means_ to store data in sync. The problem is why is **this method _not_** working. — Gaurang Tandon, Jan 24 '17 at 09:29
I would imagine it's because you're not counting the number of bytes in each string. string length isn't equivalent to the number of bytes it takes to store. If your unicode string is using a different number of bytes per character, or even possibly storing null characters and the like. Well, that might well explain your problem. It'll also explain why it works sometimes and not others, even for strings of the same number of characters. — Daniel Lane, Jan 31 '17 at 09:48
@wOxxOm I added bounty + easier method to run my question's problem. Please have a look. Thanks! — Gaurang Tandon, Jan 31 '17 at 11:41
I removed my answer because it doesn't seem to address whatever the cause of your problem is. — Daniel Lane, Jan 31 '17 at 13:20
While interesting from an esoteric point of view, why do you care? You are already automatically chopping your data into chunks. There is a max of 102,400 total bytes in `storage.sync`. If perfectly filled, you have to use 13 keys minimum. This leaves 499 other items/keys available. If exactly over by 1 byte, that increases the keys used to 26. You would then only have 486 keys remaining. The only reason to really care about this is if you *needed* those extra 13 keys for something. That is the only effective difference of selecting the *exact* `NUMBER` to exactly fill vs. `NUMBER = 315`. — Makyen, Jan 31 '17 at 19:21
@Makyen Having solved the issue, I'll note that the `NUMBER` actually varies entirely by the number of escapable characters you have in the input. So, `315` would be an insufficiently high value if each chunk had a total of more than 315 quotes, newlines, tabs, and slashes. — apsillers, Jan 31 '17 at 19:28
@apsillers, Have you tried multiple test cases to verify that what you propose is the case (rather than just the two provided in the question)? I am in the process of looking through the Chrome source code, but have not (yet) found the double JSON stringification which you suggest is the cause. I'm still looking though. I'm happy if you want to be the one to look through the [code](https://cs.chromium.org/chromium/src/extensions/browser/api/storage/), as that will provide the definitive answer. — Makyen, Jan 31 '17 at 19:41
@apsillers, if you are looking for a way to determine the *exact* amount that will be used by a particular value, it would be more appropriate to just set the value in `storage.local` and determine *exactly* how much space is *actually* used rather than trying to develop some methodology which could become outdated if Chrome changes, or not be the same on Firefox, or whatever browser is being used. The extension *should include code that handles a failure anyway* (even if you get the current algorithm), so why not just handle failures in appropriate code, which reduces the size being stored? — Makyen, Jan 31 '17 at 19:46
@Makyen Found the relevant function (`Allocate` in [settings_storage_quota_enforcer.cc](https://cs.chromium.org/chromium/src/extensions/browser/api/storage/settings_storage_quota_enforcer.cc?q=set+file:%5Esrc/extensions/browser/api/storage/+package:%5Echromium$&dr=CSs&l=5)) and edited it into my answer. — apsillers, Jan 31 '17 at 19:49
@Makyen A reasonable criticism; the behavior to save in a JSON representation is not specified anywhere and could very easily change in the future. You're suggesting loading it into `storage.local` and reading `getBytesInUse`, which seems like a reasonable idea, since assuming that `local` and `sync` will use the same number of bytes is certainly a safer assumption than that the storage representation of `chrome.storage` will never change. It's much *faster* to assume a JSON representation, but much *safer* to actually use `storage.local` as a testing ground. — apsillers, Jan 31 '17 at 19:57
@apsillers, Yes, that is what I am suggesting. I consider it *much* more likely that in a particular browser (& time), the format will be similar for `storage.sync` and `storage.local` rather than that the format will be the same over time (format change is unlikely and would have to account for old values), or that it will be the same across browsers (not asked, but certainly a consideration given current directions for browser extensions on multiple browsers). — Makyen, Jan 31 '17 at 20:11
@apsillers As to the code you linked, yes, that is where the quota is checked. I am not, yet, seeing that the value that is passed to `Allocate()` is already a JSON string (and thus checking the length of a double JSON stringified string). — Makyen, Jan 31 '17 at 20:17
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/134522/discussion-between-apsillers-and-makyen). — apsillers, Jan 31 '17 at 20:17

apsillers · Answer 1 · 2017-02-01T14:43:21.797

The problem is that Chrome applies JSON.stringify to each string chunk before storing it, which adds three \ characters to the first string (which, added to the known 2 for outer quotes, makes a full 5). This behavior is noted in the Chromium source code: Calculate the setting size based on its JSON serialization size (and the implementation does indeed compute size based on key.size() + value_as_json.size()).

That is, the value in key_0 is the string

{"my_text":"11111111...

But it is stored as

"{\"my_text\":\"11111111..."

The reason you need to account for the two outer quotes is the same reason you need to account for added slashes. Both are indicative of the output of JSON.stringify operating on a string input.

You can confirm that escape-slashes are the issue by doing

var jsonstr = JSON.stringify(objectToStore).replace(/"/g,"Z")

And observing that the required NUMBER offset is 2 instead of 5, because {Zmy_textZ:Z11111... does not have extra slashes.

I haven't looked closely, but the Lorem text contains a newline and a tab (see: id faucibus diam.\), which your JSON.stringify (correctly) turns into \n\t but then Chrome's additional stringify further expands to \\n\\t, for an extra 2 bytes you do not account for. If that gets chunked with two other quotes or other escapable characters, it could cause a chunk with 4 unaccounted-for bytes.

The solution here is to account for the escaping that Chrome will do upon storage. I'd suggest applying JSON.stringify to each segment when evaluating if it's too big, so that the correct number of bytes will be consumed by the chunking algorithm. Then, once you decide on a size that will not cause problems, even after being double-stringifed, consume that many bytes from the regular string. Something like:

while(lengthInUtf8Bytes(JSON.stringify(segment)) > maxValueBytes)
    ...

Note that this will automatically account for the two bytes from outer quotes, so there's no need to even have a QUOTA_BYTES_PER_ITEM - NUMBER computation. In the terms you've presented it, with this approach, the NUMBER is 0.

Wow! I am really surprised that no one ever before has done so much of research on this particular topic. I'd never thought JSON stringification would be so typical and difficult. Thanks anyway! **The only question before us is how to optimize that while loop condition.** I honestly feel that it's taking a lot more of the time than it should. Do you've any suggestions for that? — Gaurang Tandon, Feb 01 '17 at 11:35
The expansion function that chrome internally uses is `JSON.stringify(JSON.stringify(str))` and not just `JSON.stringify(str)`. Reasons: (1) I copied over the first table on this page - http://www.tamasoft.co.jp/en/general-info/unicode.html - into the existing lorem ipsum dolor text and it gave an error when I used `JSON.stringify` **only once**. (2) **SIMPLER APPROACH:** I just checked at exactly which size of text does it give a QUOTA_BYTES_EXCEED error. I noticed that double stringification stops at the correct size while your current approach exceeds the max value allowed. — Gaurang Tandon, Feb 04 '17 at 11:23
I've edited your answer to reflect the same and would be glad to hear some comments as to WHY this is working (even when the docs are saying that stringification is done **only once**) — Gaurang Tandon, Feb 04 '17 at 11:24

Gaurang Tandon · Answer 2 · 2017-02-04T15:38:41.430

0

For some reason, the technique only works when we do this:

while(lengthInUtf8Bytes(JSON.stringify(JSON.stringify(segment))) > maxValueBytes)

here's a paste containing data that you can use to compare both this and @apsiller's original approach (and verify the fact that only the above approach works).

Here's the code I used to test all this stuff

I am not accepting either answer yet since neither of them provides an acceptable logic as to why only the above approach is working.

edited Feb 04 '17 at 15:38

answered Feb 04 '17 at 15:33

Gaurang Tandon

6,504
11
47
84

While you've certainly shown there's some extra bytes not being accounted for, I don't see any evidence of a double-stringify. For example, when I those your unicode-plus-lorem object into key `hello` in `storage.local` and do `chrome.storage.local.getBytesInUse("hello_0", console.log)` I see `8197` which indeed too big, but nowhere near the size a double-stringify, which for the first chunk `hello_0` would be `9486`. It seems that there's probably just a few wide characters that still aren't being accounted for. – apsillers Feb 04 '17 at 16:42
@apsillers I think you're right there. Will investigate more. What could possibly those wide characters that we are missing? – Gaurang Tandon Feb 05 '17 at 08:06

score 0 · Answer 3 · answered Sep 22 '20 at 21:30

After carefully reading through this thread I finally was able to understand where the extra bytes come from. apsillers actually reference the part from the chromium code that holds the answer: key.size() + value_as_json.size()

You have to account for the side of the key as well. So the working accurate check is: while((lengthInUtf8Bytes(JSON.stringify(segment)) + key.length) > maxValueBytes)

Sync storage, max quota bytes per item and chunked data storage

3 Answers3